Deep learning matches radiologists in breast cancer detection using tomosynthesis
A systematic review and meta-analysis evaluated the diagnostic performance of deep learning (DL) algorithms for digital breast tomosynthesis (DBT) in breast cancer detection. The analysis pooled data from 13 studies encompassing 38,565 patients, comparing stand-alone DL or DL-assisted diagnosis against radiologist interpretation alone, with radiologists of varying experience levels. The primary outcomes were diagnostic performance metrics, including sensitivity, specificity, and area under the curve (AUC).
The pooled sensitivity for stand-alone DL algorithms was 0.88 (95% CI 0.80-0.93), with a specificity of 0.74 (95% CI 0.59-0.85) and an AUC of 0.89 (95% CI 0.86-0.92). When compared to all radiologists, DL demonstrated an AUC of 0.89 versus 0.88 (P=.64), indicating no statistically significant difference. Similarly, compared to senior radiologists, DL AUC was 0.89 versus 0.90 (P=.48), again showing no significant difference.
However, DL showed significantly superior sensitivity compared to junior radiologists (0.88 vs. 0.76, P=.03). Importantly, DL assistance did not statistically improve diagnostic metrics for radiologists, suggesting that current models act primarily as supplementary aids rather than definitive tools. This aligns with the practice relevance note, which emphasizes DL's role in reducing oversight in less experienced settings without elevating overall human performance.
Limitations include meta-regression identifying validation methods as a significant source of heterogeneity, and the call for future prospective multimodal studies. The analysis is observational, reporting associations rather than causation, and findings are limited to studies up to November 8, 2025. Funding and conflicts were not reported.
In clinical practice, DL algorithms for DBT exhibit strong diagnostic proficiency and higher sensitivity than junior radiologists, suggesting utility as adjunctive tools. However, they do not replace radiologist expertise and should be viewed as supplementary aids to enhance detection, particularly in settings with less experienced readers.
The certainty of pooled effect sizes is moderated by heterogeneity across the 13 included studies. Clinicians should interpret these findings within the context of the meta-analysis design and the specific patient population studied, avoiding overstatement of DL capabilities.
Overall, the evidence supports integrating DL into DBT workflows to support radiologists, especially juniors, but not as a standalone solution. Continued research is needed to refine algorithms and validate performance in diverse clinical environments.