When doctors create life-saving guidelines for sepsis and septic shock, they have to sift through mountains of complex medical papers. It is a slow, manual process that requires intense focus. A recent look at large language models, like ChatGPT-4o and Claude 3 Sonnet, explores whether AI can help do this heavy lifting.
The results are a mixed bag. For basic background information, the AI performed quite well. Models like Claude 3 Sonnet reached accuracy levels as high as 92.4% when pulling patient characteristics. This could save researchers a massive amount of time, as the AI processed these details in roughly 30 to 40 seconds per article.
However, the technology struggles when things get complicated. When it came to extracting specific clinical outcomes, accuracy dropped significantly, with one model falling as low as 27.8%. The AI also struggled to stay consistent across different sessions when handling these harder tasks. Most errors happened because the models missed or misread values.
While these tools can reliably support the easier parts of a medical review, they are not ready to work alone. For the most critical clinical results, human oversight is still essential to ensure the data is correct.