Ensemble MLLM strategies improve diagnostic performance for paediatric pneumonia on CXRsAI Team Reads Kids' X-Rays Faster Than Waiting for Specialists

medRxiv Published April 16, 2026 Study authors: Tan, J.; Tang, P. H. DOI ↗ Editorial oversight: Dr. Lars van Dijk, PhD · Surgical, Procedural & Diagnostic

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway

Consider ensemble MLLM strategies for paediatric pneumonia CXR analysis with caution due to observational evidence.

This retrospective cohort study evaluated the diagnostic performance of ensemble strategies using multiple large language models (MLLMs) for paediatric pneumonia on chest X-rays (CXRs). The analysis included 2300 CXRs from paediatric patients across two independent hospital datasets, comparing ensemble methods like Soft voting, Majority voting, and GPTOSS-20B aggregation against a baseline average agent. Primary and secondary outcomes focused on diagnostic metrics, with no follow-up period reported.

Main results showed that Soft voting achieved statistically significant improvements in multiple outcomes. For the primary outcome of OvR AUROC, p-values were 0.0002 in balanced and 0.0003 in real-world settings. Secondary outcomes also improved: accuracy had p-values of 0.0008 and <0.0001, Cohen's kappa had 0.0006 and 0.0054, OvO AUROC had <0.0001 and 0.0011, and F1-value had 0.0028 in balanced settings. Absolute numbers for these metrics were not reported.

Safety and tolerability data, including adverse events and discontinuations, were not reported in the study. Limitations were not specified, and funding or conflicts of interest were not reported. The practice relevance suggests potential for integration into emergency departments, but clinicians should note the observational nature of the evidence, which precludes causal inferences and requires validation in prospective settings.

Fifteen AI agents working together spot pneumonia on children's chest X-rays more accurately.
Helps kids in ERs where radiologists are backed up or not available.
Still early — tested on past scans, not yet used in live hospitals.

A new study shows that pooling the opinions of many AI agents can catch childhood pneumonia on X-rays with surprising accuracy, offering hope for faster care in busy emergency rooms.

When every minute matters

Picture a worried parent sitting in an ER at 2 a.m. Their child is coughing hard and breathing fast. The doctor orders a chest X-ray.

Now the wait begins. A radiologist has to read that image before treatment decisions get made. In many hospitals, that specialist is not on site overnight.

Those delays can stretch for hours. And for a sick child, hours feel like forever.

A common illness with a hidden bottleneck

Pneumonia is one of the top reasons kids get very sick around the world. It inflames the tiny air sacs in the lungs, making breathing hard and painful.

Chest X-rays are a key tool for diagnosing it. But the problem is not the X-ray itself.

The problem is finding someone trained to read it quickly.

Many hospitals, especially smaller ones or those in rural areas, simply do not have enough radiologists. That shortage creates backlogs. Sick kids wait longer. Treatment starts later.

The old way of using AI

For years, scientists have tried to use AI to read X-rays. Most of these tools are called "deep learning classifiers."

They are very good at one job: saying "pneumonia" or "not pneumonia." But they cannot explain their thinking. They cannot talk to a doctor or a parent.

Newer AI models, called multimodal large language models (MLLMs), can do both. They can look at an image and then chat about what they see in plain language.

But here's the twist: until now, these chatty AI models have been less accurate than the older, silent ones. Researchers wanted to close that gap.

Many minds are better than one

So a team tried something clever. Instead of using one AI, they used fifteen.

Think of it like a medical panel. If you had one doctor look at an X-ray, you might get one opinion. But if fifteen doctors looked and then took a vote, you would probably get a more reliable answer.

That is the idea behind "ensemble" AI. Many agents look at the same image. Their answers get combined in different ways to reach a final call.

The researchers tested three ways of combining answers. One was a simple majority vote. Another, called "soft voting," weighed how confident each agent felt. A third used another AI to make sense of everyone's opinions.

The study in plain terms

The team pulled 2,300 chest X-rays from two separate children's hospitals. None of the images came from the same place, which helps show the method works in different settings.

They ran each image through fifteen copies of a model called MedGemma-4B. Each AI rated the likelihood of pneumonia on a five-point scale. Then the voting methods compared notes.

The test was retrospective, meaning the AI looked at old scans that had already been reviewed by humans.

Soft voting won. Across both hospital datasets, it beat the average single-agent performance on nearly every measure that mattered.

Accuracy went up. Agreement with expert readings went up. And importantly, specificity was high, meaning the system did a good job of not crying wolf on healthy kids.

This doesn't mean this AI is reading X-rays in your local ER yet.

The numbers looked strong enough to suggest real promise. In statistics speak, the improvements were highly significant, with p-values well below the usual bar for chance.

Why doctors might actually trust this one

One feature sets this tool apart. Because it uses a language model, it can explain what it sees.

Instead of just flagging "pneumonia suspected," it can point to which parts of the lung look concerning and why. That transparency matters. Doctors are far more likely to trust AI they can question.

The system also runs locally, which means patient images never have to leave the hospital. That helps protect privacy — a big deal for children's medical records.

What this could mean for families

Right now, this is research. It is not a tool your pediatrician can use tomorrow.

But if future studies hold up, it could change how ERs handle suspected pneumonia overnight. The AI could flag high-risk cases for faster treatment, while low-risk images wait for the morning radiologist.

If your child needs a chest X-ray today, nothing about your care changes. Keep following your doctor's advice. Ask questions. Trust the process.

What the study could not prove

This study looked backward at stored images. It did not watch real doctors use the tool with real patients in real time.

That matters. A tool that works well on a clean dataset may act differently in a chaotic ER. The study also only tested one AI model in one size. Bigger or smaller models might perform differently.

And 2,300 X-rays, while meaningful, is still a limited sample compared to the millions of pediatric X-rays taken each year.

Where this research goes next

The next step is testing the system in live hospital settings. Researchers need to see how it performs when ER doctors use it during real shifts, with real time pressure.

After that come larger trials across more hospitals and more diverse patient groups. Regulatory review would follow before any tool like this could be formally deployed.

Medical AI moves carefully, and for good reason. When children's health is on the line, getting it right matters more than getting it fast.

Study Details

Study typeCohort

EvidenceLevel 3

PublishedApr 2026

View Original Abstract ↓

Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.

Ensemble MLLM strategies improve diagnostic performance for paediatric pneumonia on CXRsAI Team Reads Kids' X-Rays Faster Than Waiting for Specialists

When every minute matters

A common illness with a hidden bottleneck

The old way of using AI

Many minds are better than one

The study in plain terms

Why doctors might actually trust this one

What this could mean for families

What the study could not prove

Where this research goes next

Study Details

Machine Learning Model Performance for Predicting Contrast Induced Nephropathy in Coronary Interventions

Machine learning models help predict kidney damage during heart procedures

Clinical research that matters. Delivered to your inbox.

Ensemble MLLM strategies improve diagnostic performance for paediatric pneumonia on CXRsAI Team Reads Kids' X-Rays Faster Than Waiting for Specialists

When every minute matters

A common illness with a hidden bottleneck

The old way of using AI

Many minds are better than one

The study in plain terms

Why doctors might actually trust this one

What this could mean for families

What the study could not prove

Where this research goes next

More on Pneumonia

Study Details

Machine Learning Model Performance for Predicting Contrast Induced Nephropathy in Coronary Interventions

Machine learning models help predict kidney damage during heart procedures

Clinical research that matters. Delivered to your inbox.

Related in Radiology & Imaging

From Other Specialties