Mode
Text Size
Log in / Sign up

AI pipeline shows substantial agreement with dermatologists for grading atopic dermatitis erythema severity

AI pipeline shows substantial agreement with dermatologists for grading atopic dermatitis erythema s…
Photo by Sam Moghadam / Unsplash
Key Takeaway
Consider AI grading of AD lesions as a potential objective tool, but recognize validation is early and limited.

This cross-sectional validation study evaluated an artificial intelligence pipeline for grading erythema, excoriation, and lichenification severity from clinical photographs of atopic dermatitis lesions. The AI was tested on 41 independent test images and compared against a consensus reference standard established by two blinded dermatologists and two blinded physicians.

The AI demonstrated substantial agreement with dermatologist consensus for erythema grading (kappa=0.68, accuracy 80.7%). Agreement was lower when compared to physician consensus (kappa=0.34, accuracy 54.8%). For excoriation grading, the AI showed moderate agreement with dermatologists (kappa=0.62, accuracy 72.4%), with similar agreement for lichenification grading (kappa=0.59, accuracy 71.4%). In internal validation, the severity CNN component achieved 84% overall accuracy with 86% sensitivity and 87% specificity.

Safety and tolerability data were not reported. Key limitations include the small validation set of 41 images, underrepresentation of severe cases (score 3), potential rater bias as one rater participated in both training and validation, and insufficient sample size for robust stratification by skin tone or body site. The study suggests this AI pipeline has potential as a standardized, objective tool for AD severity assessment, but its clinical impact and generalizability require confirmation through prospective validation in larger, more diverse cohorts.

Study Details

Study typeCohort
EvidenceLevel 3
PublishedMar 2026
View Original Abstract ↓
BackgroundAtopic dermatitis (AD) is a prevalent chronic inflammatory skin disease associated with clinical, psychosocial, and economic burden. Accurate severity assessment is essential for guiding treatment escalation and monitoring disease activity, yet clinician-based scoring systems such as the Eczema Area and Severity Index (EASI) are limited by subjectivity and considerable inter- and intra-rater variability. Erythema, a key driver of AD severity grading, is particularly prone to inconsistent evaluation due to differences in ambient lighting, device quality, skin tone, and rater experience, underscoring the need for objective, reproducible assessment tools. ObjectiveTo develop and validate an artificial intelligence (AI) pipeline for grading erythema, excoriation, and lichenification severity in AD from clinical photographs. The study evaluated the level of agreement between AI severity ratings in each category against dermatologists, non-specialists, and a consensus reference standard, with erythema as the primary outcome of interest. MethodsA two-stage AI pipeline was developed using EfficientNet B7 convolutional neural networks (CNNs). The first CNN was trained as a binary AD classifier on 451 AD and 601 non-AD images for lesion detection and segmentation. The second CNN was trained on 173 dermatologist-annotated AD images which were scored on a 0-3 ordinal scale for erythema, excoriation, and lichenification. This CNN had downstream, feature extraction algorithms such red channel contrast for erythema, Laws E5L5 for excoriation, and S5L5 texture maps for lichenification. In a cross-sectional validation study, 41 independent test images were scored by two blinded dermatologists and two blinded physicians. AI predictions were compared to individual rater groups and mode-derived consensus scores using weighted Cohens kappa ({kappa}), classification accuracy, confusion matrices, and error direction analyses. ResultsOn internal validation, the severity CNN achieved 84% overall accuracy (averaged across all three attributes), 86% sensitivity, 87% specificity, and a macro-averaged area under the receiver operating characteristic curve (AUC) of 0.90. In the external comparison with blinded human raters, erythema agreement between the AI and dermatologist consensus was substantial (accuracy 80.7%; {kappa} = 0.68), with no large ([≥]2-point) misclassifications. Physician consensus agreement was lower (accuracy 54.8%; {kappa} = 0.34), reflecting greater variability among primary care physicians (non-specialists). For excoriation, AI-dermatologist agreement was moderate (accuracy 72.4%; {kappa} = 0.62); for lichenification, agreement was similar (accuracy 71.4%; {kappa} = 0.59). Across all features, disagreements were predominantly between adjacent severity categories. The AI was able to generate erythema severity grades for images of darker skin tones that dermatologists typically would not rate and were marked as "unable to assess." LimitationsThe validation set was small (41 images), severe cases (score 3) were underrepresented, one rater participated in both training annotation and validation scoring, and sample size was insufficient for robust stratification by skin tone or body site. ConclusionThe AI pipeline demonstrated dermatologist-level accuracy for erythema scoring, consistent moderate agreement for excoriation and lichenification, and a potential advantage in assessing erythema on darker skin tones. These findings support its potential as a standardized, objective tool for AD severity assessment. Prospective validation in larger, more diverse cohorts is warranted.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.