A new similarity measure for vector space models in text classification and information retrieval

EMİNAĞAOĞLU, METE

doi:10.1177/0165551520968055

A new similarity measure for vector space models in text classification and information retrieval

EMİNAĞAOĞLU M.

JOURNAL OF INFORMATION SCIENCE, cilt.48, sa.4, ss.463-476, 2022 (SCI-Expanded, SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 48 Sayı: 4
Basım Tarihi: 2022
Doi Numarası: 10.1177/0165551520968055
Dergi Adı: JOURNAL OF INFORMATION SCIENCE
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, FRANCIS, IBZ Online, Periodicals Index Online, ABI/INFORM, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Index Islamicus, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Library, Information Science & Technology Abstracts (LISTA), Metadex, Civil Engineering Abstracts
Sayfa Sayıları: ss.463-476
Anahtar Kelimeler: Distance metrics, K-means, k-nearest neighbours, Rocchio classifier, similarity measures, text classification, vector space models
Dokuz Eylül Üniversitesi Adresli: Evet

Özet

There are various models, methodologies and algorithms that can be used today for document classification, information retrieval and other text mining applications and systems. One of them is the vector space-based models, where distance metrics or similarity measures lie at the core of such models. Vector space-based model is one of the fast and simple alternatives for the processing of textual data; however, its accuracy, precision and reliability still need significant improvements. In this study, a new similarity measure is proposed, which can be effectively used for vector space models and related algorithms such as k-nearest neighbours (k-NN) and Rocchio as well as some clustering algorithms such as K-means. The proposed similarity measure is tested with some universal benchmark data sets in Turkish and English, and the results are compared with some other standard metrics such as Euclidean distance, Manhattan distance, Chebyshev distance, Canberra distance, Bray-Curtis dissimilarity, Pearson correlation coefficient and Cosine similarity. Some successful and promising results have been obtained, which show that this proposed similarity measure could be alternatively used within all suitable algorithms and models for information retrieval, document clustering and text classification.