Turkish word N-gram analyzing algorithms for a large scale Turkish corpus - TurCo


Çebi Y., Dalkılıç G.

International Conference on Information Technology - Coding and Computing, Nevada, Amerika Birleşik Devletleri, 5 - 07 Nisan 2004, ss.236-240 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.1109/itcc.2004.1286638
  • Basıldığı Şehir: Nevada
  • Basıldığı Ülke: Amerika Birleşik Devletleri
  • Sayfa Sayıları: ss.236-240
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

To calculate some statistical properties of a language, first you need to take some samples of that language. That sample is called a corpus. An unbalanced large scale Turkish text corpus (TurCo) having similar to362 MB capacity and more than 50 million words was prepared by using 12 different resources including web sites and novels in Turkish language. Different algorithms were tested to obtain the n-gram (1 less than or equal to n less than or equal to 5) values. Efficiencies of different algorithms have been examined by applying them onto the each piece of the corpus one by one. Only detailed results of the two algorithms created without using database tables are given, be-cause all the other algorithms need to run more than one day which makes those tests inefficient.