Evaluating machine learning-based PM2.5 estimation using integrated high-resolution datasets across NDVI levels in an urban-industrialized region


Elbir T., Tuna Tuygun G., Gündoğdu S., Bilgiç E.

Environmental Pollution, cilt.382, 2025 (SCI-Expanded) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 382
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1016/j.envpol.2025.126734
  • Dergi Adı: Environmental Pollution
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, PASCAL, Aerospace Database, Aqualine, Aquatic Science & Fisheries Abstracts (ASFA), BIOSIS, Biotechnology Research Abstracts, CAB Abstracts, Chemical Abstracts Core, Chimica, Communication Abstracts, Compendex, EMBASE, Environment Index, Food Science & Technology Abstracts, Geobase, Greenfile, MEDLINE, Metadex, Pollution Abstracts, Veterinary Science Database, Civil Engineering Abstracts
  • Anahtar Kelimeler: Finer-resolution data, Machine learning, NDVI-Based categorization, PM2.5 estimation
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

Air pollution remains a critical public health and environmental challenge in rapidly urbanizing and industrialized regions worldwide. The Marmara Region of Türkiye, including the megacity of Istanbul, exemplifies such complexity due to intense industrial activity, dense population, and diverse land use. This study presents an innovative framework for estimating daily mean PM2.5 concentrations in the Marmara Region, where complex emissions and meteorological conditions make accurate prediction vital for effective air quality management. Four advanced machine learning models – Random Forest, Extreme Gradient Boosting, Categorical Boosting, and Light Gradient Boosting Machine (LightGBM) – were employed using a unique combination of high-resolution datasets, including MAIAC Aerosol Optical Depth at a 1-km resolution, ERA5 meteorological reanalysis, EDGARv8.1 emission inventories, Normalized Difference Vegetation Index (NDVI), Corine Land Cover, and Gridded Population of the World demographic data. LightGBM achieved the highest performance (R = 0.88, RMSE = 6.42 μg/m3), providing robust predictions across seasons and locations. Analysis based on NDVI revealed that areas with low vegetation had weaker model performance (R = 0.83), while other categories showed consistent performance (R ≈ 0.88–0.89). Notably, RMSE values improved as NDVI increased. Seasonal modeling showed the lowest performance in winter (R = 0.82) and the highest in autumn (R = 0.89). Feature importance analysis identified boundary layer height, solar radiation, and population density as key predictors, highlighting the interplay between atmospheric processes and human activities. Compared to existing studies, our approach, integrating multiple high-resolution datasets, effectively captures PM2.5 variability in complex urban environments. This study enhances understanding of PM2.5 dynamics in highly urbanized and industrialized regions and offers a scalable framework for high-resolution air quality modeling.