This retrospective cohort study analyzed the assessment and plan sections of clinical notes from 542 children aged 4-6 years with ADHD diagnoses in a California community pediatric network. Researchers used three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify parent training and behavior management (PTBM) recommendations, comparing performance to manual expert chart review.
The primary outcome was model performance in identifying PTBM recommendations. For Claude-3.5, sensitivity was 0.89, PPV was 0.95, and F1-score was 0.92. For LLaMA-3.3-70B, sensitivity was 0.91, PPV was 0.89, and F1-score was 0.90. For GPT-4o, PPV was 0.97, sensitivity was 0.82, and F1-score was 0.89. Based on the best-performing model (Claude-3.5), 26.4% of patients (143/542) had documented PTBM recommendations.
Safety and adverse events were not reported. Key limitations include the retrospective design, single healthcare network setting, and lack of reported p-values or confidence intervals. The study did not assess clinical outcomes or patient adherence to recommendations.
Practice relevance suggests LLMs can extract guideline-concordant PTBM recommendations from unstructured notes with clear explanations, supporting transparent quality-of-care measurement. However, this evidence is observational and does not establish causality or replace clinician review.
View Original Abstract ↓
ImportanceGuideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement.
ObjectiveTo evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review.
Design, Setting, and ParticipantsThis retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with [≥] 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations.
ExposuresAssessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence.
Main Outcomes and MeasuresModel performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework).
ResultsAll three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit.
Conclusions and RelevanceLLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.