Mode
Text Size
Log in / Sign up

Ensemble MLLM strategies improve diagnostic performance for paediatric pneumonia on CXRs

Ensemble MLLM strategies improve diagnostic performance for paediatric pneumonia on CXRs
Photo by JOSE PETRO / Unsplash
Key Takeaway
Consider ensemble MLLM strategies for paediatric pneumonia CXR analysis with caution due to observational evidence.

This retrospective cohort study evaluated the diagnostic performance of ensemble strategies using multiple large language models (MLLMs) for paediatric pneumonia on chest X-rays (CXRs). The analysis included 2300 CXRs from paediatric patients across two independent hospital datasets, comparing ensemble methods like Soft voting, Majority voting, and GPTOSS-20B aggregation against a baseline average agent. Primary and secondary outcomes focused on diagnostic metrics, with no follow-up period reported.

Main results showed that Soft voting achieved statistically significant improvements in multiple outcomes. For the primary outcome of OvR AUROC, p-values were 0.0002 in balanced and 0.0003 in real-world settings. Secondary outcomes also improved: accuracy had p-values of 0.0008 and <0.0001, Cohen's kappa had 0.0006 and 0.0054, OvO AUROC had <0.0001 and 0.0011, and F1-value had 0.0028 in balanced settings. Absolute numbers for these metrics were not reported.

Safety and tolerability data, including adverse events and discontinuations, were not reported in the study. Limitations were not specified, and funding or conflicts of interest were not reported. The practice relevance suggests potential for integration into emergency departments, but clinicians should note the observational nature of the evidence, which precludes causal inferences and requires validation in prospective settings.

Study Details

Study typeCohort
EvidenceLevel 3
PublishedApr 2026
View Original Abstract ↓
Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.