Preclinical evaluation of five LLMs on nephrology multiple-choice questions

Photo by Brett Jordan / Unsplash

medRxiv Published April 23, 2026 Medically reviewed April 25, 2026 Study authors: Ahangaran, M.; Jia, S.; Chitalia, S.; Athavale, A.; Francis, J. M.; O'Donnell, M. W.; Bavi, S. R.; G… DOI ↗ By Dr. Amelia Tan, PhD · Internal Medicine & Chronic Disease

Key Takeaway

Consider that LLM performance on nephrology questions varies, but this preclinical data does not support clinical use.

This is a preclinical evaluation of five open-source large language models (PodGPT, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, Gemma-2-9B-it) answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP). The primary outcome was accuracy, defined as the proportion of correctly answered questions.

PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. For secondary outcomes, PodGPT had the lowest factual error rate (0.017), and Llama and Falcon achieved the lowest reasoning error rates (0.038). The authors note that quality of explanatory responses was assessed using BLEU, WER, cosine similarity, and Flesch-Kincaid Grade Level.

The authors acknowledge this is a single evaluation of a specific question set, with no reported follow-up, sample size, or population details. Generalizability to clinical practice is not established. No adverse events were reported, as this is a preclinical model evaluation.

Practice relevance is limited to the reported question-answering task; no causal claims are made. The findings should not be interpreted as evidence of clinical decision-support effectiveness.

Study Details

EvidenceLevel 5

PublishedApr 2026

View Original Abstract ↓

Background: Large Language Models (LLMs) have demonstrated strong performance in medical question-answering tasks, highlighting their potential for clinical decision support and medical education. However, their effectiveness in subspecialty areas such as nephrology remains underexplored. In this study, we assess the performance of open-source LLMs in answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP) to better understand their capabilities and limitations within this specialized clinical domain. Methods: We evaluated the performance of five open-source large language models (LLMs): PodGPT which a podcast-pretrained model focused on STEMM disciplines, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, and Gemma-2-9B-it. Each model was tested on its ability to answer multiple-choice questions derived from the NephSAP. Model performance was quantified using accuracy, defined as the proportion of correctly answered questions. In addition, the quality of the models explanatory responses was assessed using several natural language processing (NLP) metrics: Bilingual Evaluation Understudy (BLEU), Word Error Rate (WER), cosine similarity, and Flesch-Kincaid Grade Level (FKGL). For qualitative analysis, three board-certified nephrologists reviewed 40 randomly selected model responses to identify factual and clinical reasoning errors, with performance summarized as average error ratios based on the proportion of error-associated words per response. Results: Among the evaluated models, PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. Qualitative analysis showed that PodGPT had the lowest factual error rate (0.017), while Llama and Falcon achieved the lowest reasoning error rates (0.038). Conclusions: This study highlights the importance of STEMM-based training to enhance the reasoning capabilities and reliability of LLMs in clinical contexts, supporting the development of more effective AI-driven decision-support tools in nephrology and other medical specialties.