Mode
Text Size
Log in / Sign up

Large language models provide consistent informational quality but low verifiability for varicose veins informationAI models provide similar quality information on varicose veins

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Note that while LLMs provide readable content, they show low verifiability for varicose veins information.

This benchmarking study evaluated the performance of five publicly accessible large language models (ChatGPT 5.2, DeepSeek-V3.2, Gemini 3 Pro, Grok 4.1, and Qwen3-Max) in providing information regarding varicose veins. The assessment focused on informational quality using DISCERN, EQIP, and GQS scores, as well as verifiability via JAMA scores and readability indices.

Results indicated that informational quality was broadly similar across all models, with DISCERN means of 46.50 to 50.75, EQIP means of 71.50 to 74.25, and GQS medians of 4.0. No significant differences were found between models after Holm adjustment. While all models exceeded sixth-grade readability thresholds, verifiability scores (JAMA) were uniformly low across the board, with means ranging from 0.00 to 0.25 and medians of 0.

The authors note that these results do not measure claim-level clinical accuracy or safety. Small differences in informational quality should be interpreted cautiously. The findings highlight a need for better provenance reporting and uncertainty communication in LLM outputs intended for patient use.

When people look for health information online, they often turn to AI tools like ChatGPT or Gemini. A new study looked at how five different AI models answered 20 common questions about varicose veins. The goal was to see if some models provided better quality information than others.

The results showed that the quality of information was broadly similar across all five models. While the AI responses were easy to read and met standard readability goals, they failed significantly in one key area: verifiability. This means the AI could not provide reliable sources or evidence for its claims.

Because these tools are becoming more common, this study highlights a major risk. While the AI can speak clearly, it does not always provide accurate or verifiable medical facts. Patients should be cautious when using AI for health concerns and should always talk to a doctor to confirm any information.

What this means for you:
AI models give similar quality info on varicose veins but lack reliable sources or evidence.

Common questions

Are the answers from different AI models equally good?

Yes, the study found that informational quality scores were broadly similar across all five AI models tested. While there were small differences in some scores, they were not significant enough to say one model was better than another at providing information about varicose veins.

Is the information provided by AI easy to read?

Yes, all of the AI models tested exceeded the standard requirements for readability. Their responses were written at a level that is easy for most people to understand, typically meeting or exceeding sixth-grade reading levels.

Can I trust the sources provided by AI for medical info?

The study found that verifiability was uniformly low across all models. This means the AI tools struggled to provide reliable evidence or verifiable sources for their claims about varicose veins, so you should always consult a doctor.

Study Details

Study typeGuideline
EvidenceLevel 5
PublishedJun 2026
View Original Abstract ↓
ObjectivesTo benchmark the informational quality, verifiability indicators, and readability of publicly accessible large language model (LLM) responses to standardized, patient-facing varicose veins (VV) questions.MethodsTwenty single-intent VV questions were derived from PubMed-indexed VV and chronic venous disease guidelines and consensus statements (search date: February 10, 2026). The question set was designed as a decision-critical benchmark across the care pathway rather than a prevalence-weighted sample of real-world patient queries. Five publicly accessible LLMs (ChatGPT 5.2, DeepSeek-V3.2, Gemini 3 Pro, Grok 4.1, and Qwen3-Max) were queried through their official web interfaces from February 10 to 12, 2026, under default settings, generating 100 responses. Each prompt was entered in a new privacy-mode session, and refusals or other non-responsive outputs were retained as returned. Two blinded clinicians independently rated DISCERN (16–80), EQIP (0%−100%), GQS (1–5), and the JAMA benchmark (0–4). In this study, JAMA was used as a structured measure of visible attribution and verifiability-related features rather than as a comprehensive measure of transparency for conversational AI. Readability was assessed using six standard indices. Interrater reliability was evaluated with ICC(A,1) and weighted Cohen's κ. Between-model differences were tested using Friedman tests with Kendall's W and Holm adjustment.ResultsInterrater reliability was high [DISCERN ICC(A,1) = 0.913; EQIP ICC(A,1) = 0.859; GQS κ = 0.883; JAMA κ = 0.864]. Informational-quality scores were broadly similar across models (DISCERN means, 46.50–50.75; EQIP means, 71.50–74.25; GQS medians, 4.0). JAMA scores were uniformly low (means, 0.00–0.25; medians, 0), indicating sparse visible attribution and limited verifiability cues in default outputs. Between-model differences in the primary informational-quality outcomes were small and were not significant after Holm adjustment. Readability differences were more pronounced, and all models exceeded commonly recommended sixth-grade readability thresholds.ConclusionsUnder default public-user settings, publicly accessible LLMs generated fluent VV responses with limited visible verifiability indicators and suboptimal readability. Differences in the primary informational-quality outcomes were modest and should be interpreted cautiously. This benchmark evaluates communication-related performance rather than claim-level clinical accuracy or safety. These findings support efforts to improve auditability, provenance reporting, and uncertainty communication, but these dimensions do not substitute for formal assessment of factual accuracy, guideline concordance, and clinical safety.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.