An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Matheus A. Ferraria; Pedro P. Balbi; Leandro N. de Castro

doi:10.1007/978-3-031-82073-1_25

Back

Book chapter

An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Matheus A. Ferraria, Pedro P. Balbi and Leandro N. de Castro

Distributed Computing and Artificial Intelligence, 21st International Conference, Vol.1259, pp.250-260

Lecture Notes in Networks and Systems, Springer Nature Switzerland

02-18-2025

DOI: https://doi.org/10.1007/978-3-031-82073-1_25

Abstract

Clustering

Information retrieval

Natural Computing

NLP

Text Mining

This research investigates the challenges and effectiveness of various text representation methods (standard vector, grammar-based, and distributed), when applied to clustering short texts. The study explores Bag-of-Words for standard vector, Linguistic Inquiry and Word Count (LIWC), Part-of-Speech Tagging (POS-Tagging), and the Medical Research Council Psycholinguistic Database (MRC) for grammar-based, and Word2Vec, fastText, Doc2Vec, and SentenceBERT for distributed representations. Utilizing the aiNet bio-inspired clustering algorithm, the results reveal surprising findings, with grammar-based representations demonstrating competitive performance despite their simplicity, while standard vectors exhibit known challenges like high dimensionality. The study contributes insights into the properties of different text representations, providing a foundation for optimizing their application in clustering tasks with short and informal texts.

Metrics

22 Record Views

Details

Title: An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm
Creators: Matheus A. Ferraria
Pedro P. Balbi
Leandro N. de Castro
Contributors: Ravikumar Chinthaginjala (Editor)
Pawel Sitek (Editor)
Nasro Min-Allah (Editor)
Kenji Matsui (Editor)
Sascha Ossowski (Editor)
Sara Rodríguez (Editor)
Publication Details: Distributed Computing and Artificial Intelligence, 21st International Conference, Vol.1259, pp.250-260
Series: Lecture Notes in Networks and Systems
Publisher: Springer Nature Switzerland; Cham
Number of pages: 11
Grant note: CNPq: PQ 303356/2022-7 CAPES: 88881.694458/2022-01, 88887.310281/2018-00 FAPESP: 2021/11905-0
Supported by CNPq for the research grant PQ 303356/2022-7; CAPES for the projects STIC-AmSud (CAMA) No. 88881.694458/2022-01; Mackenzie-PrInt No. 88887.310281/2018-00; and FAPESP for grant 2021/11905-0.
Identifiers: 99384138284006570
Academic Unit: Department of Computing and Software Engineering
Language: English
Resource Type: Book chapter

An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Abstract

Related links

Metrics

Details