Review of benchmarking study on multimodal AI for mental status examination classifications

Photo by Keith Tanner / Unsplash

medRxiv Published April 21, 2026 Medically reviewed April 25, 2026 Study authors: Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-… DOI ↗ By Dr. Ji-eun Park, MD · Brain, Mind & Pain

Key Takeaway

Consider that multimodal AI shows moderate alignment with experts for MSE classifications, but accuracy declines with clinical complexity.

This is a review of a benchmarking study that evaluated the Qwen3-Omni model against independent expert panels for Mental Status Examination (MSE) classifications across 10 domains and three longitudinal timepoints (T0, T1, T2) at UTHealth and Yale. The study involved 396 classifications.

The authors synthesized that expert agreement on MSE classifications was substantial (Gwet's AC1 = 0.87), while model alignment was moderate (Gwet's AC1 = 0.70-0.72). Model accuracy was 84.8-87% at T0, 80-82% at T1, and 71-73% at T2. The model's pathology prediction rate approximated that of experts, but model errors relative to human expert disagreement increased 2.3-to-3.4 fold. The effect of 4-bit quantization disproportionately degraded inferential reasoning while preserving perceptual feature extraction.

Key limitations noted include the model systematically over-predicting observable signs while failing in inferential domains, and model-to-expert agreement degrading linearly as clinical complexity intensified. The authors acknowledge that generalizability is limited to the two reported sites.

The practice relevance establishes inter-expert variance as a baseline for psychiatric AI, but true clinical translation requires higher-order diagnostic reasoning beyond perceptual extraction. The review is cautious about overinterpreting findings due to the study's observational design and site-specific context.

Study Details

EvidenceLevel 5

PublishedApr 2026

View Original Abstract ↓

The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwet's AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the model's overall pathology prediction rate approximated the experts', the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.

Review of benchmarking study on multimodal AI for mental status examination classifications

Study Details

PETRUSHKA decision-support tool reduced treatment discontinuation and improved symptoms in major depressive disorder

The Algorithm That Helps Doctors Pick the Right Antidepressant for You

Clinical research that matters. Delivered to your inbox.

Review of benchmarking study on multimodal AI for mental status examination classifications

Study Details

PETRUSHKA decision-support tool reduced treatment discontinuation and improved symptoms in major depressive disorder

The Algorithm That Helps Doctors Pick the Right Antidepressant for You

Clinical research that matters. Delivered to your inbox.

Related in Psychiatry

From Other Specialties