Word statistics of Turkish language on a large scale text corpus - TurCo

Dalkiliç G., Çebi Y.

International Conference on Information Technology: Coding Computing, ITCC 2004, Las Vegas, NV, Amerika Birleşik Devletleri, 5 - 07 Nisan 2004, cilt.2, ss.319-323, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 2
Doi Numarası: 10.1109/itcc.2004.1286654
Basıldığı Şehir: Las Vegas, NV
Basıldığı Ülke: Amerika Birleşik Devletleri
Sayfa Sayıları: ss.319-323
Dokuz Eylül Üniversitesi Adresli: Evet

Özet

Determination of the statistical properties of a natural language is one of the most important part of the language analysis. Number of Different Words (NODW), and Different Word Usage Ratio (DWUR) concepts are some of the general characteristics of a corpus. These values are described and calculated for the Turkish Corpus (TurCo). Also, word n-grams are calculated for Turkish which was done for English years ago but couldn't be done for Turkish because of the lack of a large scale corpus. Obtained results from n-grams were compared with the results of the Brown corpus (very known corpus for English) and similarity between TurCo and Brown corpus was examined.