Mode
Text Size
Log in / Sign up

Comparative evaluation of simulated reasoning and self-verification for LLM psychiatric diagnosis

Comparative evaluation of simulated reasoning and self-verification for LLM psychiatric diagnosis
Photo by Google DeepMind / Unsplash
Key Takeaway
Consider that self-verification and large reasoning models may improve PPV for LLM psychiatric diagnosis on vignettes, but evidence remains preliminary and not ready for clinical use.

This is a comparative evaluation of large language models (LLMs) for psychiatric diagnosis using 106 case vignettes. The study compared a basic inference approach (direct prompting) to a self-verification approach augmented with additional prompts, across models from two vendors. Primary outcomes were diagnostic performance metrics: sensitivity and positive predictive value (PPV). Sensitivity ranged from 0.732 to 0.817 across models and approaches, with no statistically significant fixed effects found for sensitivity. PPV ranged from 0.534 to 0.779, with the best overall performance observed with the o3-pro large reasoning model using self-verification (PPV 0.779). The authors reported a statistically significant effect for prompt type (self-verification, p=0.007) and model type (large reasoning model, p=0.009), indicating improvement in PPV with these factors. The authors note that the case vignettes were extracted from a single source (DSM-5-TR Clinical Cases book) and evaluation was limited to two model vendors, which restricts generalizability. They also highlight that this is not a primary trial and results show associations rather than causation. Practice relevance is limited to suggesting that manually crafted reasoning prompts and automated simulated reasoning could benefit future language model applications in behavioral health, but the authors caution against inferring that these approaches cause improved diagnosis in clinical practice or that LLMs are ready for autonomous psychiatric diagnosis.

Study Details

EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
BackgroundLarge language models (LLMs), and, more recently, large reasoning models (LRMs) have rapidly garnered significant interest for application in psychiatry and behavioral health. However, recent studies have identified significant shortcomings and potential risks in the performance of LLM-based systems, complicating their application to psychiatric diagnosis. Two promising approaches to addressing these challenges and improving the efficacy of these models are simulated reasoning (SR) and self-verification (SV), in which additional "reasoning tokens" are used to guide model output, either during or after inference. ObjectivesWe aimed to explore how the use of SR (via LRMs) and SV (via supplemental prompting) affect the psychiatric diagnostic performance of LLMs. Methods106 case vignettes and associated diagnoses were extracted from the DSM-5-TR Clinical Cases book, with permission. Both an LLM and LRM model were selected from the latest available model generation for each of the two vendors studied (OpenAI and Google). Two inference approaches were developed, a Basic approach that directly prompted models to provide diagnoses, and a SV approach that augmented the Basic approach with additional prompts. All case vignettes were processed by the two LLMs, two LRMs, and two inference approaches, and diagnostic performance was evaluated using the sensitivity and positive predictive value (PPV). Binomial generalized linear mixed models were used to test for significant differences between the model vendors (OpenAI, Google), type (LLM, LRM), and addition of an SV prompt. ResultsAll vignettes were successfully processed by each model and inference approach. Sensitivity ranged from 0.732 to 0.817, and PPV ranged from 0.534 to 0.779. The best overall performance was found in the o3-pro LRM using SV, with a sensitivity of 0.782 and a PPV of 0.779. No statistically significant fixed effects were found for sensitivity. For PPV, a statistically significant effect was found for prompt type (SV, p=0.007), model type (LRM, p=0.009). No significant interaction effects were identified. ConclusionsWe found that both SR and SV yielded statistically significant improvements in the PPV, without significant differences in the sensitivity. The addition of the manually specified SV prompt improved the PPV even when simulated reasoning was used. This suggests that future efforts to apply language models in behavioral health could benefit from manually crafted reasoning prompts and automated SR.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.