Large language models show potential for psychosis risk assessment from clinical interview transcripts
This evaluation study assessed whether 11 open-weight large language models (LLMs) could extract clinically meaningful information from PSYCHS interview transcripts to support psychosis risk assessment. The analysis included 373 participants (77.7% with clinical high risk for psychosis) with 678 interview transcripts, comparing LLM outputs against researcher-rated scores for classification and symptom severity assessment.
For psychosis risk status classification, larger models achieved the strongest performance, with Llama-3.3-70B showing accuracy of 0.80, sensitivity of 0.93, and specificity of 0.58. LLM-generated symptom severity and frequency scores demonstrated good correlation with researcher ratings, with intraclass correlation coefficients of 0.74 and 0.75 respectively. Performance disparities were minimal across most demographic groups but varied across different sites.
The generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologization of non-clinical experiences. Accuracy scaled with model size, though smaller models achieved competitive performance with substantially lower computational cost. Safety data, adverse events, and specific limitations were not reported in the available information.
These preliminary findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, potentially supporting scalable, human-in-the-loop approaches to early detection. However, as an evaluation study without reported limitations or validation in clinical practice, these results require cautious interpretation and further investigation before considering implementation.