Mode
Text Size
Log in / Sign up

Human-written medical histories rated higher than ChatGPT in paediatric orthopaedicsHuman notes beat AI for pediatric knee care summaries

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Consider that human-written medical histories remain superior to ChatGPT-generated ones in paediatric orthopaedics, especially for accuracy and consistency.

This randomized, blinded comparative study evaluated the quality of medical history summaries created by humans versus ChatGPT in a paediatric orthopaedic practice. Twenty consecutive paediatric patients (mean age 14.2 ± 2.3 years; 11 males, 9 females) were included. Each patient had two summaries: one human-created and one ChatGPT-generated, both based on the same clinical encounter.

Primary outcome was overall rating on a 6-point Likert scale. Human summaries scored significantly higher (5.2 ± 0.8) than ChatGPT summaries (4.5 ± 0.8), with a large effect size (Cohen's d = 0.80, p < 0.001). Secondary outcomes including temporal consistency, spatial consistency, accident description, and overall impression also significantly favored human documentation (all p < 0.001). For example, ChatGPT had errors in accident description in 21 of 60 evaluations (35%) and in temporal consistency in 14 of 60 evaluations (23%). No significant differences were found for writing style or documentation of previous interventions.

Safety and tolerability were not reported. A key limitation was moderate inter-rater reliability (ICC = 0.64). The study suggests that current large language models are not ready to replace human medical documentation in paediatric orthopaedic practice without careful oversight, supporting hybrid workflows where AI assists but does not replace human clinical judgement.

A study compared notes written by doctors to those generated by ChatGPT for children with knee problems. The research included 20 patients with an average age of 14.2 years. Doctors rated the human-written summaries higher than the AI versions on overall quality. This difference was statistically significant across several categories including consistency over time and accuracy of accident descriptions.

The study followed the patients for about 28 months. While the AI summaries matched human writing style in some ways, they were less accurate in describing specific medical details. No safety issues were reported during the study period. The researchers found that human documentation was significantly better at capturing important clinical information.

The main takeaway is that large language models are not ready to replace human medical documentation in pediatric orthopaedic practice without careful oversight. The findings support hybrid workflows where AI assists but does not replace human clinical judgement. This small study suggests doctors should continue to write their own notes for these patients.

What this means for you:
Human doctors wrote better summaries than AI for children with knee problems in this small study.

Study Details

Study typeRct
EvidenceLevel 2
Follow-up27.6 mo
PublishedMay 2026
View Original Abstract ↓
PURPOSE: This study evaluated the quality of ChatGPT-generated medical history summaries compared to human-created documentation in a paediatric orthopaedic practice setting. METHODS: A prospective, randomised, blinded comparative study was conducted involving 20 consecutive paediatric patients (mean age 14.2 ± 2.3 years; 11 males, 9 females) presenting with knee problems. Audio recordings of medical consultations were transcribed and processed by ChatGPT-4o (OpenAI) using standardised prompts. Three independent orthopaedic specialists evaluated both human-generated and AI-generated summaries using eight quality criteria: temporal consistency, spatial consistency, accident description, symptom accuracy, symptom specificity, previous interventions, writing style and overall impression. Each criterion was scored on a 6-point Likert scale. RESULTS: Human-created summaries received significantly higher overall ratings (5.2 ± 0.8) compared to ChatGPT-generated summaries (4.5 ± 0.8, p < 0.001, Cohen's d = 0.80). After Bonferroni correction for multiple comparisons, statistically significant differences favouring human documentation were confirmed in four of eight criteria: temporal consistency (p < 0.001), spatial consistency (p < 0.001), accident description (p < 0.001) and overall impression (p < 0.001). No significant differences were observed for writing style and documentation of previous interventions. Inter-rater reliability was moderate (ICC = 0.64). ChatGPT demonstrated frequent temporal inconsistencies (14 of 60 evaluations, 23%) and omission of relevant accident details (21 of 60 evaluations, 35%). CONCLUSION: While AI-generated summaries showed acceptable stylistic quality, human documentation significantly outperformed ChatGPT in critical clinical dimensions, including temporal consistency and accuracy of complex orthopaedic presentations. Current large language models are not ready to replace human medical documentation in paediatric orthopaedic practice without careful oversight. The findings support the implementation of hybrid workflows where AI assists but does not replace human clinical judgement. LEVEL OF EVIDENCE: Level I.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.