A new similarity measure for vector space models in text classification and information retrieval


EMİNAĞAOĞLU M.

JOURNAL OF INFORMATION SCIENCE, cilt.48, sa.4, ss.463-476, 2022 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 48 Sayı: 4
  • Basım Tarihi: 2022
  • Doi Numarası: 10.1177/0165551520968055
  • Dergi Adı: JOURNAL OF INFORMATION SCIENCE
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, FRANCIS, IBZ Online, Periodicals Index Online, ABI/INFORM, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Index Islamicus, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Library, Information Science & Technology Abstracts (LISTA), Metadex, Civil Engineering Abstracts
  • Sayfa Sayıları: ss.463-476
  • Anahtar Kelimeler: Distance metrics, K-means, k-nearest neighbours, Rocchio classifier, similarity measures, text classification, vector space models
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

There are various models, methodologies and algorithms that can be used today for document classification, information retrieval and other text mining applications and systems. One of them is the vector space-based models, where distance metrics or similarity measures lie at the core of such models. Vector space-based model is one of the fast and simple alternatives for the processing of textual data; however, its accuracy, precision and reliability still need significant improvements. In this study, a new similarity measure is proposed, which can be effectively used for vector space models and related algorithms such as k-nearest neighbours (k-NN) and Rocchio as well as some clustering algorithms such as K-means. The proposed similarity measure is tested with some universal benchmark data sets in Turkish and English, and the results are compared with some other standard metrics such as Euclidean distance, Manhattan distance, Chebyshev distance, Canberra distance, Bray-Curtis dissimilarity, Pearson correlation coefficient and Cosine similarity. Some successful and promising results have been obtained, which show that this proposed similarity measure could be alternatively used within all suitable algorithms and models for information retrieval, document clustering and text classification.