A New Similarity Measure for Document Classification and Text Mining


Creative Commons License

EMİNAĞAOĞLU M., GÖKŞEN Y.

Conference on Economies of the Balkan and Eastern European Countries, Bucharest, Romanya, 10 - 12 Mayıs 2019, ss.353-366 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.18502/kss.v4i1.5999
  • Basıldığı Şehir: Bucharest
  • Basıldığı Ülke: Romanya
  • Sayfa Sayıları: ss.353-366
  • Anahtar Kelimeler: text mining, document classification, similarity measures, k-NN, Rocchio algorithm
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today's world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems.