Joint Tomek Links (JTL): An Innovative Approach to Noise Reduction for Enhanced Classification Performance


TÜYSÜZOĞLU G., DOĞAN Y., Kiyak E. O., Ersahin M., Ghasemkhani B., BİRANT K. U., ...Daha Fazla

IEEE ACCESS, cilt.13, ss.123059-123082, 2025 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 13
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1109/access.2025.3580290
  • Dergi Adı: IEEE ACCESS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.123059-123082
  • Anahtar Kelimeler: Noise, Accuracy, Noise measurement, Machine learning, Random forests, Classification algorithms, Nearest neighbor methods, Data mining, Support vector machines, Training, Artificial intelligence, classification, data mining, machine learning, noise reduction, Tomek links
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

Noisy data is a prevalent issue in data mining, significantly impacting the performance of classification algorithms. Mathematical methods are crucial in tackling this obstacle, particularly in optimizing noise detection and data preprocessing. This study proposes a novel approach-Joint Tomek Links (JTL)- to identify and eliminate noisy instances by detecting pairs of nearest neighbors from different classes. It first finds the Tomek links and then refines a probabilistic method to determine which instance from a pair will be removed. In our approach, a random tree classifier serves as the base model. We conducted experiments on 40 benchmark datasets spanning various domains, achieving an average classification accuracy of 83.26% for JTL. The results demonstrate that the JTL attains an average improvement of 5.33% in accuracy compared to the original classification with a random tree. Furthermore, JTL surpasses existing techniques, delivering a noteworthy gain in accuracy by 12.30% on the same datasets. These findings underscore the effectiveness of JTL in enhancing data quality and boosting classification performance in data mining tasks.