Comparison of different lemmatization approaches for information retrieval on Turkish text collection


ÖZTÜRKMENOĞLU O., ALPKOÇAK A.

International Symposium on INnovations in Intelligent SysTems and Applications, INISTA 2012, Trabzon, Türkiye, 2 - 04 Temmuz 2012 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.1109/inista.2012.6246934
  • Basıldığı Şehir: Trabzon
  • Basıldığı Ülke: Türkiye
  • Anahtar Kelimeler: Information Retrieval, Lemmatization, Normalization, Turkish Information Retrieval
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

In this paper, we compare the performance of different lemmatization approaches for information retrieval over Turkish text collection. A lemma is simply the "dictionary form" of a word and lemmatization is the process of determining the lemma for a given word where different inflected forms of a word can be analyzed as a single item. We compared three different lemmatizer and one fixed length truncation approaches over Turkish text collection. The first one is based on morphological analyzer for Turkish using with finite state language processing technology; another one is Dictionary-based Turkish Lemmatizer (DTL), which uses radix-trie data structure; the third one is a simple dictionary based top-down parser and the last one is truncation of words at fix length. We have assessed the performance of lemmatizers on Bilkent University Milliyet collection, which contains more than 400K documents. The comparison of performance analysis was done by the well-known IR evaluation metrics and experimented in the IR system. The results we obtained show that the lemmatization process improves IR performance and we achieved the best results using with Turkish Lemmatizer that is DTL radix-trie data structure and it used the minimum number of terms in IR system. © 2012 IEEE.