AI pipeline shows substantial agreement with dermatologists for grading atopic dermatitis erythema severity
This cross-sectional validation study evaluated an artificial intelligence pipeline for grading erythema, excoriation, and lichenification severity from clinical photographs of atopic dermatitis lesions. The AI was tested on 41 independent test images and compared against a consensus reference standard established by two blinded dermatologists and two blinded physicians.
The AI demonstrated substantial agreement with dermatologist consensus for erythema grading (kappa=0.68, accuracy 80.7%). Agreement was lower when compared to physician consensus (kappa=0.34, accuracy 54.8%). For excoriation grading, the AI showed moderate agreement with dermatologists (kappa=0.62, accuracy 72.4%), with similar agreement for lichenification grading (kappa=0.59, accuracy 71.4%). In internal validation, the severity CNN component achieved 84% overall accuracy with 86% sensitivity and 87% specificity.
Safety and tolerability data were not reported. Key limitations include the small validation set of 41 images, underrepresentation of severe cases (score 3), potential rater bias as one rater participated in both training and validation, and insufficient sample size for robust stratification by skin tone or body site. The study suggests this AI pipeline has potential as a standardized, objective tool for AD severity assessment, but its clinical impact and generalizability require confirmation through prospective validation in larger, more diverse cohorts.