Mode
Text Size
Log in / Sign up

Large language models show potential for psychosis risk assessment from clinical interview transcripts

Large language models show potential for psychosis risk assessment from clinical interview transcrip…
Photo by Artfox Photography / Unsplash
Key Takeaway
Consider LLMs as potential screening tools for psychosis risk, but recognize this is early evaluation evidence requiring clinical validation.

This evaluation study assessed whether 11 open-weight large language models (LLMs) could extract clinically meaningful information from PSYCHS interview transcripts to support psychosis risk assessment. The analysis included 373 participants (77.7% with clinical high risk for psychosis) with 678 interview transcripts, comparing LLM outputs against researcher-rated scores for classification and symptom severity assessment.

For psychosis risk status classification, larger models achieved the strongest performance, with Llama-3.3-70B showing accuracy of 0.80, sensitivity of 0.93, and specificity of 0.58. LLM-generated symptom severity and frequency scores demonstrated good correlation with researcher ratings, with intraclass correlation coefficients of 0.74 and 0.75 respectively. Performance disparities were minimal across most demographic groups but varied across different sites.

The generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologization of non-clinical experiences. Accuracy scaled with model size, though smaller models achieved competitive performance with substantially lower computational cost. Safety data, adverse events, and specific limitations were not reported in the available information.

These preliminary findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, potentially supporting scalable, human-in-the-loop approaches to early detection. However, as an evaluation study without reported limitations or validation in clinical practice, these results require cautious interpretation and further investigation before considering implementation.

Study Details

Sample sizen = 373
EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.