Mode
Text Size
Log in / Sign up

Simulation study of frontier AI models analyzing 10,000 synthetic Multiple Sclerosis cases reveals variable diagnostic safetyAI Doctors Miss Critical Clues in 10,000 Tests

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Note that frontier AI models showed variable safety in synthetic MS simulations before clinical deployment.

This publication is a simulation study involving 10,000 synthetic Multiple Sclerosis cases, expanded from an initial 1,000. Frontier artificial intelligence models, specifically Gemini 3 Pro/Flash and GPT 5.2/5 mini, were instructed to analyze these cases. The models were compared against ground-truth labels and blinded subspecialty experts to assess decision-making regarding anatomical localization, differential diagnoses, investigations, and management plans. The study population consisted entirely of synthetic cases rather than real patients.

Key findings indicate that while synthetic case realism was 100% confirmed by subspecialist expert review and automated evaluation accuracy was 99.8% (95% CI 95.5 to 100), specific clinical recommendations varied widely. For instance, clinically appropriate steroid recommendations ranged from 7.2% (95% CI 5.6 to 8.8) for Gemini 3 Flash to 23.5% (95% CI 20.8 to 26.1) for GPT 5 mini. Intravenous thrombolysis recommendations were below 1% for Gemini models but reached 9.6% for GPT 5.2. Notably, thrombolysis recommendations occurred in 10.1% of cases lacking symptom timing information and 2.9% when symptoms were documented as more than 14 days old. MS inclusion in differential diagnoses occurred in more than 91% of cases.

The authors note that evaluations largely rely on small collections of cases, which limits the generalizability of the findings. Safety data, including adverse events and tolerability, were not reported. The study does not attribute causality between model architecture and errors without explicit reporting. Consequently, one should not infer clinical outcomes from synthetic case simulations or assume patient safety based solely on automated evaluation accuracy.

The authors conclude that massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk. This practice relevance highlights the need for rigorous testing environments prior to real-world application.

Imagine walking into a clinic with a severe headache. You need the right medicine fast. But what if the computer suggesting your treatment is missing a key detail?

That is exactly what happened in a massive new test of artificial intelligence.

Multiple sclerosis (MS) is a complex condition that affects the nerves in your brain and spine. It can cause weakness, vision loss, and numbness. Doctors must carefully rule out other causes before starting treatment.

Current tools often fail here. They might miss rare symptoms or suggest dangerous drugs. This puts patients at risk.

The surprising shift

Scientists used to test AI on small groups of fake patients. They thought the computers were safe. But that was a mistake.

This new study changed everything. Researchers created 10,000 realistic MS cases. They used these to find hidden errors in the latest AI models.

What scientists didn't expect

The computers were very good at naming the disease. They correctly identified MS in over 90% of cases. That sounds impressive, right?

But here is the twist. Being smart about the diagnosis did not mean being safe with the treatment plan.

Think of the AI like a very fast but confused intern. It can read the chart quickly. But it sometimes ignores the fine print.

For example, some steroids are great for MS flares. But they are dangerous if you have an active infection. The AI often missed this warning sign.

Another major error involved a powerful clot-busting drug. This drug is only for heart attacks or strokes that happen right now.

The AI suggested this drug for MS patients. It did this even when the patient's symptoms had been going on for weeks. That is not a stroke.

The team generated 10,000 synthetic cases. These were not real people. They were computer-generated scenarios with perfect answers.

They tested four top AI models. Two were from Google, and two were from OpenAI.

Human experts checked 70 cases first. They confirmed the fake cases looked real. Then the AI analyzed the rest.

The results were clear. The AI models made serious mistakes.

One model suggested the clot-busting drug in nearly 10% of cases. That is far too high. The safe limit is below 1%.

Another model failed to recommend steroids often enough. It missed infections that made steroids dangerous.

These errors happened because the AI lacked common sense. It followed patterns but missed the big picture.

This doesn't mean this treatment is available yet.

That is a critical point to remember. These are not real patients. We are testing the software before it ever sees a human.

Medical leaders say this is a wake-up call. Small tests cannot find these deep flaws.

You need thousands of cases to see the rare failures. Only then can we build better safety guards.

Do not worry. These AI tools are not in your doctor's office yet. They are still in the lab.

However, this research helps scientists build safer tools for the future. It ensures that when AI does help doctors, it will not hurt patients.

Your doctor will always be the final decision-maker. They will use their experience to guide your care.

This study used fake cases. Real patients are more unpredictable.

Also, the study focused on MS. Other diseases might have different risks.

We must be careful not to overstate what we know. More research is needed.

Next, scientists will use these methods to test other diseases. They will look for more hidden dangers.

The goal is to make AI a true helper for doctors. Not a replacement.

We need to move fast but stay safe. This research is the first step toward that goal.

Study Details

EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.