Comparative evaluation of simulated reasoning and self-verification for LLM psychiatric diagnosis
This is a comparative evaluation of large language models (LLMs) for psychiatric diagnosis using 106 case vignettes. The study compared a basic inference approach (direct prompting) to a self-verification approach augmented with additional prompts, across models from two vendors. Primary outcomes were diagnostic performance metrics: sensitivity and positive predictive value (PPV). Sensitivity ranged from 0.732 to 0.817 across models and approaches, with no statistically significant fixed effects found for sensitivity. PPV ranged from 0.534 to 0.779, with the best overall performance observed with the o3-pro large reasoning model using self-verification (PPV 0.779). The authors reported a statistically significant effect for prompt type (self-verification, p=0.007) and model type (large reasoning model, p=0.009), indicating improvement in PPV with these factors. The authors note that the case vignettes were extracted from a single source (DSM-5-TR Clinical Cases book) and evaluation was limited to two model vendors, which restricts generalizability. They also highlight that this is not a primary trial and results show associations rather than causation. Practice relevance is limited to suggesting that manually crafted reasoning prompts and automated simulated reasoning could benefit future language model applications in behavioral health, but the authors caution against inferring that these approaches cause improved diagnosis in clinical practice or that LLMs are ready for autonomous psychiatric diagnosis.