This publication is a simulation study involving 10,000 synthetic Multiple Sclerosis cases, expanded from an initial 1,000. Frontier artificial intelligence models, specifically Gemini 3 Pro/Flash and GPT 5.2/5 mini, were instructed to analyze these cases. The models were compared against ground-truth labels and blinded subspecialty experts to assess decision-making regarding anatomical localization, differential diagnoses, investigations, and management plans. The study population consisted entirely of synthetic cases rather than real patients.
Key findings indicate that while synthetic case realism was 100% confirmed by subspecialist expert review and automated evaluation accuracy was 99.8% (95% CI 95.5 to 100), specific clinical recommendations varied widely. For instance, clinically appropriate steroid recommendations ranged from 7.2% (95% CI 5.6 to 8.8) for Gemini 3 Flash to 23.5% (95% CI 20.8 to 26.1) for GPT 5 mini. Intravenous thrombolysis recommendations were below 1% for Gemini models but reached 9.6% for GPT 5.2. Notably, thrombolysis recommendations occurred in 10.1% of cases lacking symptom timing information and 2.9% when symptoms were documented as more than 14 days old. MS inclusion in differential diagnoses occurred in more than 91% of cases.
The authors note that evaluations largely rely on small collections of cases, which limits the generalizability of the findings. Safety data, including adverse events and tolerability, were not reported. The study does not attribute causality between model architecture and errors without explicit reporting. Consequently, one should not infer clinical outcomes from synthetic case simulations or assume patient safety based solely on automated evaluation accuracy.
The authors conclude that massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk. This practice relevance highlights the need for rigorous testing environments prior to real-world application.
View Original Abstract ↓
Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.