Diagnostic Accuracy of Machine Learning Algorithms in Electrocardiogram-Based Heart Failure Detection: A Systematic Review and Meta-Analysis

Kılıç, Mustafa; Arayıcı, MEHMET; Akbilgiç, Oğuz; Yılmaz, MEHMET

doi:10.1016/j.cjca.2025.12.022

Diagnostic Accuracy of Machine Learning Algorithms in Electrocardiogram-Based Heart Failure Detection: A Systematic Review and Meta-Analysis

Kılıç M. E., Arayıcı M. E., Akbilgiç O., Yılmaz M. B.

CANADIAN JOURNAL OF CARDIOLOGY, cilt.42, ss.1-16, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Derleme
Cilt numarası: 42
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.cjca.2025.12.022
Dergi Adı: CANADIAN JOURNAL OF CARDIOLOGY
Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), BIOSIS, EMBASE, MEDLINE
Sayfa Sayıları: ss.1-16
Dokuz Eylül Üniversitesi Adresli: Evet

Özet

Background

Artificial intelligence (AI) applied to the electrocardiogram (ECG) shows promise for detecting heart failure (HF), but reported performance is heterogeneous. A key ambiguity is the conflation of two distinct diagnostic targets: the structural abnormality of left ventricular systolic dysfunction (LVSD) and the clinical syndrome of HF.

Methods

Following PRISMA-DTA guidelines, this systematic review and meta-analysis analyzed 40 unique, non-overlapping patient cohorts. Diagnostic accuracy was synthesized using a hierarchical bivariate model, addressing ejection fraction (EF) threshold heterogeneity via stratification and multi-threshold analysis. Prespecified bivariate meta-regressions examined covariates including external validation status, lead configuration, and model architecture. A secondary analysis evaluated HF classification models.

Results

The primary analysis yielded a pooled sensitivity of 85.9% (95% CI 82.8–88.5%) and specificity of 80.9% (95% CI 75.8–85.1%), with a hierarchical summary receiver operating characteristic area under the curve (HSROC AUC) of 0.902 (95% CI 0.885–0.915). Performance varied significantly by target definition (LVSD vs. clinical HF) and EF threshold used. Meta-regression revealed that 12-lead ECGs (p=0.003) and convolutional neural network architectures (p=0.024) were associated with higher specificity. The secondary analysis (7 studies) yielded pooled sensitivity of 96.2% and specificity of 92.1%.

Conclusions

AI-ECG demonstrates substantial but variable diagnostic performance that depends critically on target condition definition, EF thresholds, and methodological factors. Implementation must account for these dependencies and utilize precise, standardized endpoints.