Effects of diacritics on Turkish information retrieval


ALPKOÇAK A., Ceylan M.

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, cilt.20, sa.5, ss.787-804, 2012 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 20 Sayı: 5
  • Basım Tarihi: 2012
  • Doi Numarası: 10.3906/elk-1010-819
  • Dergi Adı: TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, TR DİZİN (ULAKBİM)
  • Sayfa Sayıları: ss.787-804
  • Anahtar Kelimeler: Turkish information retrieval, diacritics, document expansion, query expansion
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

We investigate the effects of improper use of diacritics in the Turkish alphabet on information retrieval. A diacritic is simply a supplementary sign added to a letter to change the sound value of the letter, and the Turkish alphabet has 5 special letters derived from Latin by adding different diacritics. The statistical analysis performed in this study shows that retrieval performance significantly decreases when documents and queries contain letters with different forms, such that documents consist of letters with diacritics while queries consist of standard Latin letters and vice versa. In order to tackle this challenge, we propose 3 approaches: token normalization by equivalence classes, document expansion, and query expansion. The experimental evaluations carried on the Bilkent Turkish information retrieval test collection suggests that the proposed approaches are promising as a remedy in this line of research.