AI pipeline shows substantial agreement with dermatologists for grading atopic dermatitis erythema severity

Photo by Sam Moghadam / Unsplash

medRxiv Published April 6, 2026 Medically reviewed April 26, 2026 Study authors: Abdolahnejad, M.; Kyremeh, M.; Smith, J.; Fang, G.; Chan, H. O.; Joshi, R.; Hong, C. DOI ↗ By Dr. Amelia Tan, PhD · Internal Medicine & Chronic Disease

Key Takeaway

Consider AI grading of AD lesions as a potential objective tool, but recognize validation is early and limited.

This cross-sectional validation study evaluated an artificial intelligence pipeline for grading erythema, excoriation, and lichenification severity from clinical photographs of atopic dermatitis lesions. The AI was tested on 41 independent test images and compared against a consensus reference standard established by two blinded dermatologists and two blinded physicians.

The AI demonstrated substantial agreement with dermatologist consensus for erythema grading (kappa=0.68, accuracy 80.7%). Agreement was lower when compared to physician consensus (kappa=0.34, accuracy 54.8%). For excoriation grading, the AI showed moderate agreement with dermatologists (kappa=0.62, accuracy 72.4%), with similar agreement for lichenification grading (kappa=0.59, accuracy 71.4%). In internal validation, the severity CNN component achieved 84% overall accuracy with 86% sensitivity and 87% specificity.

Safety and tolerability data were not reported. Key limitations include the small validation set of 41 images, underrepresentation of severe cases (score 3), potential rater bias as one rater participated in both training and validation, and insufficient sample size for robust stratification by skin tone or body site. The study suggests this AI pipeline has potential as a standardized, objective tool for AD severity assessment, but its clinical impact and generalizability require confirmation through prospective validation in larger, more diverse cohorts.

Study Details

Study typeCohort

EvidenceLevel 3

PublishedMar 2026

View Original Abstract ↓

BackgroundAtopic dermatitis (AD) is a prevalent chronic inflammatory skin disease associated with clinical, psychosocial, and economic burden. Accurate severity assessment is essential for guiding treatment escalation and monitoring disease activity, yet clinician-based scoring systems such as the Eczema Area and Severity Index (EASI) are limited by subjectivity and considerable inter- and intra-rater variability. Erythema, a key driver of AD severity grading, is particularly prone to inconsistent evaluation due to differences in ambient lighting, device quality, skin tone, and rater experience, underscoring the need for objective, reproducible assessment tools. ObjectiveTo develop and validate an artificial intelligence (AI) pipeline for grading erythema, excoriation, and lichenification severity in AD from clinical photographs. The study evaluated the level of agreement between AI severity ratings in each category against dermatologists, non-specialists, and a consensus reference standard, with erythema as the primary outcome of interest. MethodsA two-stage AI pipeline was developed using EfficientNet B7 convolutional neural networks (CNNs). The first CNN was trained as a binary AD classifier on 451 AD and 601 non-AD images for lesion detection and segmentation. The second CNN was trained on 173 dermatologist-annotated AD images which were scored on a 0-3 ordinal scale for erythema, excoriation, and lichenification. This CNN had downstream, feature extraction algorithms such red channel contrast for erythema, Laws E5L5 for excoriation, and S5L5 texture maps for lichenification. In a cross-sectional validation study, 41 independent test images were scored by two blinded dermatologists and two blinded physicians. AI predictions were compared to individual rater groups and mode-derived consensus scores using weighted Cohens kappa ({kappa}), classification accuracy, confusion matrices, and error direction analyses. ResultsOn internal validation, the severity CNN achieved 84% overall accuracy (averaged across all three attributes), 86% sensitivity, 87% specificity, and a macro-averaged area under the receiver operating characteristic curve (AUC) of 0.90. In the external comparison with blinded human raters, erythema agreement between the AI and dermatologist consensus was substantial (accuracy 80.7%; {kappa} = 0.68), with no large ([≥]2-point) misclassifications. Physician consensus agreement was lower (accuracy 54.8%; {kappa} = 0.34), reflecting greater variability among primary care physicians (non-specialists). For excoriation, AI-dermatologist agreement was moderate (accuracy 72.4%; {kappa} = 0.62); for lichenification, agreement was similar (accuracy 71.4%; {kappa} = 0.59). Across all features, disagreements were predominantly between adjacent severity categories. The AI was able to generate erythema severity grades for images of darker skin tones that dermatologists typically would not rate and were marked as "unable to assess." LimitationsThe validation set was small (41 images), severe cases (score 3) were underrepresented, one rater participated in both training annotation and validation scoring, and sample size was insufficient for robust stratification by skin tone or body site. ConclusionThe AI pipeline demonstrated dermatologist-level accuracy for erythema scoring, consistent moderate agreement for excoriation and lichenification, and a potential advantage in assessing erythema on darker skin tones. These findings support its potential as a standardized, objective tool for AD severity assessment. Prospective validation in larger, more diverse cohorts is warranted.

AI pipeline shows substantial agreement with dermatologists for grading atopic dermatitis erythema severity

Study Details

Guselkumab Phase III trials validate 3VAS and 4VAS measures for psoriatic arthritis disease activity assessment

Study checks how well pain scores track joint damage in psoriatic arthritis patients

Clinical research that matters. Delivered to your inbox.

AI pipeline shows substantial agreement with dermatologists for grading atopic dermatitis erythema severity

More on Atopic Dermatitis

Study Details

Guselkumab Phase III trials validate 3VAS and 4VAS measures for psoriatic arthritis disease activity assessment

Study checks how well pain scores track joint damage in psoriatic arthritis patients

Clinical research that matters. Delivered to your inbox.

Related in Dermatology

From Other Specialties