Mode
Text Size
Log in / Sign up

Review of five language models for rheumatology case evaluation with and without retrieval augmentation

Review of five language models for rheumatology case evaluation with and without retrieval…
Photo by Brett Jordan / Unsplash
Key Takeaway
Consider that smaller language models with retrieval augmentation can match larger models on standardized rheumatology cases, but expert oversight remains essential.

This is a review that evaluated five state-of-the-art language models on ten standardized, anonymized rheumatology cases. The models were assessed with and without retrieval-augmented generation (RAG) and with or without a predefined diagnosis. The primary outcomes were diagnostic and therapeutic accuracy, quantified using F1 scores, and factual consistency, assessed using the Retrieval-Augmented Generation Assessment Score (RAGAS).

The key synthesized finding is that Mixtral-8 x 7b-32768 with RAG achieved the highest diagnostic F1 score (72%) and the highest therapeutic F1 score (73%). This model also achieved the highest RAGAS score (81%). Without RAG, Nemotron-70b showed strong diagnostic performance (71%), and Qwen-Turbo performed well on therapeutic tasks (72%).

The authors acknowledge limitations, including that clinically relevant errors persisted across all models, the need for expert oversight, and the need for further real-world validation. Safety data were not reported.

The practice relevance suggested is that smaller language models combined with RAG can match or exceed the performance of larger models for clinical decision support while requiring significantly fewer computational resources. However, this conclusion is based on a small set of standardized cases and should not be overstate the clinical validity of the model performance.

Study Details

Study typeGuideline
EvidenceLevel 5
PublishedMay 2026
View Original Abstract ↓
BackgroundLarge language models (LLMs) are increasingly explored for clinical decision support but are limited by high computational and energy demands. Smaller language models (SLMs), particularly when combined with retrieval-augmented generation (RAG), may offer a more sustainable alternative. Rheumatology, characterized by diagnostic complexity and guideline-driven management, represents a suitable test domain.MethodsFive state-of-the-art language models (GPT-4o, Mixtral-8 × 7b-32768, Llama-3.1-Nemotron-70b-Instruct, Qwen-Turbo 2.5, Claude-3.5-Sonnet) were evaluated regarding their suitability for clinical decision support using ten standardized, anonymized rheumatology cases. Models were assessed with and without RAG, and with or without a predefined diagnosis. Diagnostic and therapeutic accuracy were quantified using F1 scores. Factual consistency and relevance were assessed using the Retrieval-Augmented Generation Assessment Score (RAGAS).ResultsMixtral-8 × 7b-32768 with RAG achieved the highest diagnostic (72%) and therapeutic (73%) F1 scores. Nemotron-70b showed strong diagnostic performance without RAG (71%), while Qwen-Turbo performed well in therapeutic recommendations without retrieval (72%). The highest RAGAS score was observed for Mixtral with RAG (81%). Performance regarding clinical decision support varied substantially across models and configurations.ConclusionSLMs combined with RAG can match or exceed the performance of larger LLMs for clinical decision support while requiring significantly fewer computational resources. Despite promising results, clinically relevant errors persisted across all models, underscoring the need for expert oversight and further real-world validation.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.