CANADIAN JOURNAL OF CARDIOLOGY, cilt.42, sa.4, ss.824-839, 2026 (SCI-Expanded, Scopus)
Background: AI applied to the electrocardiogram shows promise for detecting heart failure (HF), but heterogeneous performance is reported. A key ambiguity is the conflation of 2 distinct diagnostic targets: the structural abnormality of left ventricular systolic dysfunction and the clinical syndrome of HF. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy Studies (PRISMA-DTA) guidelines, in this systematic review and meta-analysis we analyzed 40 unique, nonoverlapping patient cohorts. Diagnostic accuracy was synthesized using a hierarchical bivariate model, which addressed ejection fraction (EF) threshold heterogeneity via stratification and multithreshold analysis. Prespecified bivariate meta-regressions were used to examine covariates including external validation status, lead configuration, and model architecture. In a secondary analysis we evaluated HF classification models. Results: The primary analysis yielded a pooled sensitivity of 85.9% (95% confidence interval [CI], 82.8%-88.5%) and specificity of 80.9% (95% CI, 75.8%-85.1%), with a hierarchical summary receiver operating characteristic area under the curve of 0.902 (95% CI, 0.885-0.915). Performance varied significantly according to target definition (left ventricular systolic dysfunction vs clinical HF) and EF threshold used. Meta-regression revealed that 12-lead electrocardiograms (P =0.003) and convolutional neural network architectures (P = 0.024) were associated with higher specificity values. The secondary analysis (7 studies) yielded pooled sensitivity of 96.2% and specificity of 92.1%. Conclusions: AI applied to the electrocardiogram shows substantial but variable diagnostic performance that depends critically on target condition definition, EF thresholds, and methodological factors. Implementation must account for these dependencies and use precise, standardized end points.