Mode
Text Size
Log in / Sign up

LLMs identify PTBM recommendations in ADHD notes with high performance in pediatric careAI Can Spot ADHD Care Gaps in Seconds—Not Hours

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Consider LLMs as adjuncts for extracting PTBM recommendations from ADHD notes, but interpret results cautiously due to observational design.

This retrospective cohort study analyzed the assessment and plan sections of clinical notes from 542 children aged 4-6 years with ADHD diagnoses in a California community pediatric network. Researchers used three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify parent training and behavior management (PTBM) recommendations, comparing performance to manual expert chart review.

The primary outcome was model performance in identifying PTBM recommendations. For Claude-3.5, sensitivity was 0.89, PPV was 0.95, and F1-score was 0.92. For LLaMA-3.3-70B, sensitivity was 0.91, PPV was 0.89, and F1-score was 0.90. For GPT-4o, PPV was 0.97, sensitivity was 0.82, and F1-score was 0.89. Based on the best-performing model (Claude-3.5), 26.4% of patients (143/542) had documented PTBM recommendations.

Safety and adverse events were not reported. Key limitations include the retrospective design, single healthcare network setting, and lack of reported p-values or confidence intervals. The study did not assess clinical outcomes or patient adherence to recommendations.

Practice relevance suggests LLMs can extract guideline-concordant PTBM recommendations from unstructured notes with clear explanations, supporting transparent quality-of-care measurement. However, this evidence is observational and does not establish causality or replace clinician review.

A new study shows AI can read doctor’s notes and find missed treatment steps for kids with ADHD.

Why a Doctor’s Note Matters More Than You Think

Imagine a parent brings their 5-year-old to the doctor. The child is struggling with focus, impulsivity, and behavior at home and school. The doctor diagnoses ADHD and writes a plan in the electronic health record.

But here’s the problem: That plan might include a key step—recommending parent training in behavior management—that the parent never hears about. Or the health system never knows if that recommendation was made.

Why? Because checking every doctor’s note to see if that step was followed is slow, expensive, and nearly impossible at scale.

Now, imagine an AI that can read thousands of notes in minutes and flag which kids should have gotten this recommendation.

That’s what this new study is about.

The ADHD Care Gap Most Parents Never See

ADHD is one of the most common childhood neurodevelopmental disorders. In the U.S., about 6% of children are diagnosed with ADHD.

For kids under age 6, guidelines strongly recommend starting with parent training in behavior management (PTBM)—not medication. This training teaches parents strategies to manage challenging behaviors at home.

But studies show that many young children with ADHD never get this recommendation. In this study, only about 26% of kids had it documented in their first visit.

Why the gap?

Doctors are busy. Notes are long and written in free text. Manually reviewing charts to see if PTBM was recommended is tedious and costly. So, health systems often skip it—meaning they can’t measure or improve care quality.

Old way: A team of humans reads through thousands of clinical notes, one by one, to check if PTBM was recommended. This takes months and costs a lot.

New way: An AI reads the notes in minutes, flags which ones mention PTBM, and explains why it made that call.

But here’s the twist: The AI isn’t just guessing. It’s trained to look for specific language in the “assessment and plan” section of the note and provide evidence for its decision.

How AI Thinks Like a Doctor (Sort Of)

Think of the AI like a super-fast medical detective.

When a doctor writes a note, they often include a section called “Assessment and Plan.” This is where they summarize the diagnosis and next steps.

The AI scans this section for clues that PTBM was recommended. For example, it might look for phrases like:

  • “Recommend parent behavior training”
  • “Refer to behavioral therapy”
  • “Suggest parenting class”

It’s like a spell-checker for care quality—spotting what should be there.

But unlike a simple keyword search, the AI uses context. It understands that “parent training” might be phrased differently, and it can tell the difference between a recommendation and a general discussion.

The study tested three AI models: Claude-3.5, GPT-4o, and LLaMA-3.3-70B. Each one read the same set of doctor’s notes and tried to identify whether PTBM was recommended.

How the Study Worked

Researchers looked at 542 children aged 4–6 years who were diagnosed with ADHD or ADHD symptoms between 2020 and 2024. All were seen in a California pediatric network with 27 clinics.

They took the first ADHD-related visit for each child and analyzed the doctor’s note.

A subset of 122 notes—including all cases where the AI models disagreed—was manually reviewed by experts. This helped measure how well the AI performed.

The goal: See if the AI could match expert human review in identifying PTBM recommendations.

All three AI models performed well, but one stood out.

Claude-3.5 was the most balanced:

  • Correctly identified 89% of kids who should have gotten PTBM (sensitivity)
  • Was right 95% of the time when it flagged a recommendation (positive predictive value)
  • Overall accuracy score: 92%

LLaMA-3.3-70B was close behind:

  • 91% sensitivity
  • 89% positive predictive value
  • Overall accuracy: 90%

GPT-4o had the highest precision (97%) but missed more cases (82% sensitivity), and its explanations were rated lower by experts.

Using the best model (Claude-3.5), the study found that only 26.4% of kids had documented PTBM recommendations at their first ADHD visit.

That’s a big care gap—and one that AI could help close.

But There’s a Catch

This doesn’t mean AI is ready to replace human chart review everywhere.

The study was done in one health network in California. The AI models were tested on notes from a specific time period and setting.

Also, the AI is only as good as the notes it reads. If a doctor doesn’t document a recommendation, the AI can’t find it.

What Experts Say

The study used a framework called QUEST to rate how well the AI explained its decisions. Claude-3.5 ranked highest for clarity and usefulness.

Researchers say this explainability is key. Doctors and health systems need to trust the AI’s output—and understand why it made a certain call.

This transparency could make AI more acceptable for real-world use in quality improvement.

If you’re a parent of a young child with ADHD, this research won’t change your next doctor’s visit.

But over time, it could help health systems ensure more kids get the right first-line treatment—like parent training—before considering medication.

If you’re a doctor or health system leader, this could be a tool to improve care quality without adding hours of paperwork.

This doesn’t mean this treatment is available yet.

The study only looked at one type of recommendation in one age group. It didn’t test whether AI could track if families actually followed through with training.

Also, the AI models are not perfect. They can make mistakes, and they rely on how well doctors document their plans.

Next steps include testing these AI tools in more health systems and with different types of clinical notes.

Researchers also want to see if AI can help track whether families actually complete parent training—not just whether it was recommended.

If successful, this approach could be used for other conditions and treatments, making quality measurement faster, cheaper, and more reliable.

For now, it’s a promising step toward smarter, more transparent health care.

Study Details

Study typeCohort
Sample sizen = 542
EvidenceLevel 3
PublishedApr 2026
View Original Abstract ↓
ImportanceGuideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement. ObjectiveTo evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review. Design, Setting, and ParticipantsThis retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with [≥] 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations. ExposuresAssessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence. Main Outcomes and MeasuresModel performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework). ResultsAll three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit. Conclusions and RelevanceLLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.