Mode
Text Size
Log in / Sign up

Preclinical evaluation of five LLMs on nephrology multiple-choice questionsNew AI Test Reveals Best Tool for Kidney Care

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Consider that LLM performance on nephrology questions varies, but this preclinical data does not support clinical use.

This is a preclinical evaluation of five open-source large language models (PodGPT, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, Gemma-2-9B-it) answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP). The primary outcome was accuracy, defined as the proportion of correctly answered questions.

PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. For secondary outcomes, PodGPT had the lowest factual error rate (0.017), and Llama and Falcon achieved the lowest reasoning error rates (0.038). The authors note that quality of explanatory responses was assessed using BLEU, WER, cosine similarity, and Flesch-Kincaid Grade Level.

The authors acknowledge this is a single evaluation of a specific question set, with no reported follow-up, sample size, or population details. Generalizability to clinical practice is not established. No adverse events were reported, as this is a preclinical model evaluation.

Practice relevance is limited to the reported question-answering task; no causal claims are made. The findings should not be interpreted as evidence of clinical decision-support effectiveness.

Why Doctors Need Better Tools

Kidney disease affects millions of people worldwide. It often has no early symptoms.

By the time people feel sick, the damage might be done. Doctors need every tool they can get to catch problems early.

Current treatments are good, but they are not perfect. Some patients need very specific care plans.

This is where AI could help. It can review data faster than any human.

Old Ideas About AI Changed

We often hear that AI will replace doctors. But this study shows something different.

AI is a helper, not a replacement. It needs to be trained on the right things.

Most AI models learn from general internet text. They do not know medical details.

This study tested if they could pass a kidney doctor exam.

How Scientists Measured AI Skills

Researchers tested five different AI models. They gave them questions from a real kidney doctor exam.

They checked how many answers were correct. They also looked for mistakes in the logic.

Some models made up facts. Others used bad reasoning.

One model called PodGPT was trained on science and tech talks. It performed better than the others.

Which AI Model Won the Test

PodGPT got 64% of the answers right. Another model called Llama only got 45% right.

This is a big difference. It shows that training data matters a lot.

PodGPT also made fewer factual mistakes. It did not invent fake drug names.

Llama and Falcon made fewer reasoning errors. They understood the logic better.

This does not mean you should ask AI for medical advice.

Why Accuracy Matters for You

Doctors cannot afford to make mistakes. A wrong answer could hurt a patient.

Kidney care involves many medicines and tests. One error can change a treatment plan.

The study shows some AI is ready for doctors to check. It is not ready for patients to use alone.

Experts say safety is the most important thing. An AI making a mistake could hurt a patient.

What Happens Next for AI

More testing will happen before doctors use this in clinics. Researchers need to see how it works in real life.

Real life is messier than a test. Patients have complex histories and emotions.

The study only used exam questions. It did not test real patient cases.

This is why we must wait for more research.

Scientists will keep improving these models. They want to make them safer for hospitals.

Approval takes time to ensure patient safety. We will know more when trials are finished.

For now, talk to your doctor about your health. Do not rely on AI for your care.

Study Details

EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Background: Large Language Models (LLMs) have demonstrated strong performance in medical question-answering tasks, highlighting their potential for clinical decision support and medical education. However, their effectiveness in subspecialty areas such as nephrology remains underexplored. In this study, we assess the performance of open-source LLMs in answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP) to better understand their capabilities and limitations within this specialized clinical domain. Methods: We evaluated the performance of five open-source large language models (LLMs): PodGPT which a podcast-pretrained model focused on STEMM disciplines, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, and Gemma-2-9B-it. Each model was tested on its ability to answer multiple-choice questions derived from the NephSAP. Model performance was quantified using accuracy, defined as the proportion of correctly answered questions. In addition, the quality of the models explanatory responses was assessed using several natural language processing (NLP) metrics: Bilingual Evaluation Understudy (BLEU), Word Error Rate (WER), cosine similarity, and Flesch-Kincaid Grade Level (FKGL). For qualitative analysis, three board-certified nephrologists reviewed 40 randomly selected model responses to identify factual and clinical reasoning errors, with performance summarized as average error ratios based on the proportion of error-associated words per response. Results: Among the evaluated models, PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. Qualitative analysis showed that PodGPT had the lowest factual error rate (0.017), while Llama and Falcon achieved the lowest reasoning error rates (0.038). Conclusions: This study highlights the importance of STEMM-based training to enhance the reasoning capabilities and reliability of LLMs in clinical contexts, supporting the development of more effective AI-driven decision-support tools in nephrology and other medical specialties.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.