Mode
Text Size
Log in / Sign up

Large language models show potential for psychosis risk assessment from clinical interview transcriptsAI language models show promise for assessing early psychosis risk from interviews

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Consider LLMs as potential screening tools for psychosis risk, but recognize this is early evaluation evidence requiring clinical validation.

This evaluation study assessed whether 11 open-weight large language models (LLMs) could extract clinically meaningful information from PSYCHS interview transcripts to support psychosis risk assessment. The analysis included 373 participants (77.7% with clinical high risk for psychosis) with 678 interview transcripts, comparing LLM outputs against researcher-rated scores for classification and symptom severity assessment.

For psychosis risk status classification, larger models achieved the strongest performance, with Llama-3.3-70B showing accuracy of 0.80, sensitivity of 0.93, and specificity of 0.58. LLM-generated symptom severity and frequency scores demonstrated good correlation with researcher ratings, with intraclass correlation coefficients of 0.74 and 0.75 respectively. Performance disparities were minimal across most demographic groups but varied across different sites.

The generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologization of non-clinical experiences. Accuracy scaled with model size, though smaller models achieved competitive performance with substantially lower computational cost. Safety data, adverse events, and specific limitations were not reported in the available information.

These preliminary findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, potentially supporting scalable, human-in-the-loop approaches to early detection. However, as an evaluation study without reported limitations or validation in clinical practice, these results require cautious interpretation and further investigation before considering implementation.

Researchers conducted an evaluation study to see if artificial intelligence (AI) language models could help assess psychosis risk. They used 11 different open-source AI models to analyze transcripts from 678 clinical interviews with 373 people, most of whom were considered at clinical high risk for psychosis. The goal was to see if the AI could accurately classify risk status and score symptom severity and frequency, comparing its results to those made by human researchers.

The study found that larger AI models performed the best. One specific model, Llama-3.3-70B, correctly classified risk status 80% of the time. The symptom scores generated by the AI also showed good agreement with the scores given by human researchers. The AI summaries were mostly faithful to the interview content, with only a 3% rate of making up clinically important details. However, when the AI made errors, they were mostly errors of 'over-pathologization,' meaning the AI tended to label normal human experiences as potential symptoms.

This research is an early step in exploring how AI might assist in mental health screening. The study did not report on safety concerns, as it was an analysis of existing transcripts, not a live clinical trial. The main reason to be careful is that this technology is not yet proven for real-world use. Performance also varied somewhat across different research sites. Readers should understand that this is a promising but preliminary finding. It suggests AI could one day help clinicians by quickly reviewing interview notes, but it is not a replacement for professional assessment and is far from being a standard tool.

What this means for you:
Early research suggests AI may help analyze interviews for psychosis risk, but it's not ready for clinical diagnosis.

Study Details

Sample sizen = 373
EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.