Large language models show varying accuracy for data extraction in sepsis guidelines

Photo by Gabriela / Unsplash

Frontiers in Medicine Published April 30, 2026 Medically reviewed April 30, 2026 Study authors: Takehiko Oami, Yohei Okada, Kenjiro Maeda, Taka-aki Nakada DOI ↗ By Dr. Lars van Dijk, PhD · Surgical, Procedural & Diagnostic

Key Takeaway

Note that LLMs may support background data extraction in reviews, but outcome data extraction requires human oversight.

This study assessed the utility of large language models (LLMs), specifically ChatGPT-4o, Claude 3 Sonnet, and Gemini 1.5 Pro, in extracting data from PDF files related to five clinical questions in the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024 (J-SSCG 2024). The researchers compared various prompt strategies, including original, chain-of-thought, and self-reflection (SR) prompts, against manual extraction by guideline members.

For background characteristics, the models achieved mean no-error proportions between 81.6% (ChatGPT-4) and 92.4% (Claude 3 Sonnet). Inter-session consistency for background data extraction ranged from 76.3% (ChatGPT-4o) to 91.3% (Gemini 1.5 Pro). However, accuracy for outcome data extraction was significantly more variable, ranging from 27.8% (Gemini 1.5 Pro) to 80.7% (Claude 3 Sonnet). Inter-session consistency for outcome data extraction was also lower, between 44.8% (ChatGPT-4o) and 65.6% (Claude 3 Sonnet).

Processing times varied by task and prompt type. For standard prompts, background extraction took 29.2 to 39.7 s per article, while outcome extraction took 19.3 to 46.3 s per article. Using self-reflection prompts increased processing times, with background extraction taking 59.0 to 97.7 s and outcome extraction taking 52.7 to 107.1 s per article.

A primary limitation noted was that missing or incorrect values accounted for most extraction errors. While fabricated outputs were relatively uncommon, the disparity in outcome data accuracy suggests that LLMs can reliably support background data extraction in systematic reviews, but outcome data extraction remains challenging and requires human oversight.

Study Details

Study typeGuideline

EvidenceLevel 5

PublishedApr 2026

View Original Abstract ↓

Systematic reviews depend on manual data extraction and synthesis, which are time-consuming and prone to human error. Although large language models (LLMs) have the potential to automate parts of this process, their accuracy, reproducibility, and efficiency across different models and prompt strategies remain insufficiently characterized. This study evaluated the performance of three LLMs, including ChatGPT-4o, Claude 3 Sonnet, and Gemini 1.5 Pro, for data extraction from trials addressing five clinical questions (CQs) in the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024 (J-SSCG 2024). Using portable document format files of eligible studies, LLMs extracted predefined background characteristics and clinical outcomes. Outputs generated using an original prompt were compared with those produced using chain-of-thought and self-reflection (SR) prompt strategies. Two independent reviewers assessed accuracy against a reference standard derived from manual extraction by the guideline members. Inter-session consistency across three sessions and processing time were also evaluated. For background data extraction, mean no-error proportions ranged from 81.6% (ChatGPT-4o) to 92.4% (Claude 3 Sonnet) across models. For outcome data extraction, mean no-error proportions ranged from 27.8% (Gemini 1.5 Pro) to 80.7% (Claude 3 Sonnet). Missing or incorrect values accounted for most extraction errors, whereas fabricated outputs were relatively uncommon. Prompt engineering strategies resulted in only modest changes in extraction accuracy across models. Inter-session consistency ranged from 76.3% (ChatGPT-4o) to 91.3% (Gemini 1.5 Pro) for background data extraction and from 44.8% (ChatGPT-4o) to 65.6% (Claude 3 Sonnet) for outcome data extraction. Mean processing times ranged from 29.2 to 39.7 s per article for background data extraction and from 19.3 to 46.3 s for outcome data extraction using standard prompts. When SR prompts were used, processing times increased to 59.0 to 97.7 s for background data extraction and to 52.7 to 107.1 s for outcome data extraction. LLMs can reliably support background data extraction in systematic reviews. However, outcome data extraction remains challenging, emphasizing the continued need for human oversight. Extraction performance varied across models and prompt engineering strategies. Clinical Trial Registration: The study was registered in the University Hospital Medical Information Network (UMIN) clinical trials registry, identifier (UMIN000054461).

Large language models show varying accuracy for data extraction in sepsis guidelines

Study Details

Low-use wrist-strap restraint strategy did not improve days free of delirium or coma compared with high-use strategy in mechanically ventilated adults.

ICU Wrist Restraints: The Surprising Truth About Using Fewer

Clinical research that matters. Delivered to your inbox.

Large language models show varying accuracy for data extraction in sepsis guidelines

More on Sepsis

Study Details

Low-use wrist-strap restraint strategy did not improve days free of delirium or coma compared with high-use strategy in mechanically ventilated adults.

ICU Wrist Restraints: The Surprising Truth About Using Fewer

Clinical research that matters. Delivered to your inbox.

Related in Emergency Medicine

From Other Specialties