Review of five language models for rheumatology case evaluation with and without retrieval augmentation
This is a review that evaluated five state-of-the-art language models on ten standardized, anonymized rheumatology cases. The models were assessed with and without retrieval-augmented generation (RAG) and with or without a predefined diagnosis. The primary outcomes were diagnostic and therapeutic accuracy, quantified using F1 scores, and factual consistency, assessed using the Retrieval-Augmented Generation Assessment Score (RAGAS).
The key synthesized finding is that Mixtral-8 x 7b-32768 with RAG achieved the highest diagnostic F1 score (72%) and the highest therapeutic F1 score (73%). This model also achieved the highest RAGAS score (81%). Without RAG, Nemotron-70b showed strong diagnostic performance (71%), and Qwen-Turbo performed well on therapeutic tasks (72%).
The authors acknowledge limitations, including that clinically relevant errors persisted across all models, the need for expert oversight, and the need for further real-world validation. Safety data were not reported.
The practice relevance suggested is that smaller language models combined with RAG can match or exceed the performance of larger models for clinical decision support while requiring significantly fewer computational resources. However, this conclusion is based on a small set of standardized cases and should not be overstate the clinical validity of the model performance.