Human-written medical histories rated higher than ChatGPT in paediatric orthopaedics

Photo by Andrew Neel / Unsplash

Knee surgery, sports traumatology, arthroscopy : official journal of the ESSKA Published May 1, 2026 Medically reviewed May 1, 2026 Study authors: Camathias Carlo, Papp Kata, Betschart Patrick, Speth Berhard, Valderrabano Victor, Ammann Elias PubMed ↗ DOI ↗ By Dr. Lars van Dijk, PhD · Surgical, Procedural & Diagnostic

Key Takeaway

Consider that human-written medical histories remain superior to ChatGPT-generated ones in paediatric orthopaedics, especially for accuracy and consistency.

This randomized, blinded comparative study evaluated the quality of medical history summaries created by humans versus ChatGPT in a paediatric orthopaedic practice. Twenty consecutive paediatric patients (mean age 14.2 ± 2.3 years; 11 males, 9 females) were included. Each patient had two summaries: one human-created and one ChatGPT-generated, both based on the same clinical encounter.

Primary outcome was overall rating on a 6-point Likert scale. Human summaries scored significantly higher (5.2 ± 0.8) than ChatGPT summaries (4.5 ± 0.8), with a large effect size (Cohen's d = 0.80, p < 0.001). Secondary outcomes including temporal consistency, spatial consistency, accident description, and overall impression also significantly favored human documentation (all p < 0.001). For example, ChatGPT had errors in accident description in 21 of 60 evaluations (35%) and in temporal consistency in 14 of 60 evaluations (23%). No significant differences were found for writing style or documentation of previous interventions.

Safety and tolerability were not reported. A key limitation was moderate inter-rater reliability (ICC = 0.64). The study suggests that current large language models are not ready to replace human medical documentation in paediatric orthopaedic practice without careful oversight, supporting hybrid workflows where AI assists but does not replace human clinical judgement.

Study Details

Study typeRct

EvidenceLevel 2

Follow-up27.6 mo

PublishedMay 2026

PMID41762622

View Original Abstract ↓

PURPOSE: This study evaluated the quality of ChatGPT-generated medical history summaries compared to human-created documentation in a paediatric orthopaedic practice setting. METHODS: A prospective, randomised, blinded comparative study was conducted involving 20 consecutive paediatric patients (mean age 14.2 ± 2.3 years; 11 males, 9 females) presenting with knee problems. Audio recordings of medical consultations were transcribed and processed by ChatGPT-4o (OpenAI) using standardised prompts. Three independent orthopaedic specialists evaluated both human-generated and AI-generated summaries using eight quality criteria: temporal consistency, spatial consistency, accident description, symptom accuracy, symptom specificity, previous interventions, writing style and overall impression. Each criterion was scored on a 6-point Likert scale. RESULTS: Human-created summaries received significantly higher overall ratings (5.2 ± 0.8) compared to ChatGPT-generated summaries (4.5 ± 0.8, p < 0.001, Cohen's d = 0.80). After Bonferroni correction for multiple comparisons, statistically significant differences favouring human documentation were confirmed in four of eight criteria: temporal consistency (p < 0.001), spatial consistency (p < 0.001), accident description (p < 0.001) and overall impression (p < 0.001). No significant differences were observed for writing style and documentation of previous interventions. Inter-rater reliability was moderate (ICC = 0.64). ChatGPT demonstrated frequent temporal inconsistencies (14 of 60 evaluations, 23%) and omission of relevant accident details (21 of 60 evaluations, 35%). CONCLUSION: While AI-generated summaries showed acceptable stylistic quality, human documentation significantly outperformed ChatGPT in critical clinical dimensions, including temporal consistency and accuracy of complex orthopaedic presentations. Current large language models are not ready to replace human medical documentation in paediatric orthopaedic practice without careful oversight. The findings support the implementation of hybrid workflows where AI assists but does not replace human clinical judgement. LEVEL OF EVIDENCE: Level I.

Human-written medical histories rated higher than ChatGPT in paediatric orthopaedics

Study Details

VBQ and Hounsfield Units Predict Subsequent Fractures After Vertebral Augmentation

Vertebral Bone Quality Scores Predict Fracture Risk After Vertebral Augmentation

Clinical research that matters. Delivered to your inbox.

Human-written medical histories rated higher than ChatGPT in paediatric orthopaedics

Study Details

VBQ and Hounsfield Units Predict Subsequent Fractures After Vertebral Augmentation

Vertebral Bone Quality Scores Predict Fracture Risk After Vertebral Augmentation

Clinical research that matters. Delivered to your inbox.

Related in Orthopedics & Sports Medicine

From Other Specialties