A New Similarity Measure for Document Classification and Text Mining

EMİNAĞAOĞLU M., GÖKŞEN Y.

Conference on Economies of the Balkan and Eastern European Countries, Bucharest, Romanya, 10 - 12 Mayıs 2019, ss.353-366, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası:
Doi Numarası: 10.18502/kss.v4i1.5999
Basıldığı Şehir: Bucharest
Basıldığı Ülke: Romanya
Sayfa Sayıları: ss.353-366
Anahtar Kelimeler: text mining, document classification, similarity measures, k-NN, Rocchio algorithm
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Dokuz Eylül Üniversitesi Adresli: Evet

Özet

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today's world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems.