A retrospective register-based cohort study at an academic medical center analyzed 17,655 neonates requiring intensive care to determine whether longitudinal neonatal electronic health record (EHR) data from the first 90 days of life could predict major neuropsychiatric diagnoses by age seven. The study compared Random Forest, logistic regression, and XGBoost models, with the STraTS model achieving the highest area under the precision-recall curve (AUPRC 0.171 ± 0.022). During follow-up, 8.0% (1,420) of the cohort received a major neuropsychiatric diagnosis.
Five predictors were consistently identified across multiple interpretability methods, though specific predictors and effect sizes were not reported. The researchers noted that combining multiple complementary interpretability methods yielded stable, clinically plausible risk signals, as no single explanation method provided a complete picture. Safety and tolerability data were not reported in this observational analysis.
Key limitations include that prediction accuracy remains limited by the complexity of these neuropsychiatric conditions. The study design was observational, establishing association rather than causality. Funding and conflicts of interest were not reported. For practice, identifying high-risk infants early could potentially improve follow-up care, but clinicians should interpret these prediction models cautiously given their current limited accuracy and the observational nature of the evidence.
View Original Abstract ↓
Neonates requiring intensive care are at increased risk for long-term neuropsychiatric disorders. However, clinical adoption of risk prediction models remains limited when their performance lacks adequate interpretability for informed clinical decision-making. Here, we investigated whether longitudinal neonatal electronic health record (EHR) data from the first 90 days of life can support clinically meaningful interpretation of long-term risk signals for major neuropsychiatric diagnoses by age seven. In a retrospective register-based cohort of 17,655 at-risk children from an academic medical center, of whom 8.0% (1,420) received a major neuropsychiatric diagnosis during follow-up, we applied a time-aware transformer model (Self-supervised Transformer for Time-Series; STraTS) and thoroughly evaluated its predictions using three complementary interpretability approaches: perturbation-based variable importance, value-dependent effect analysis, and leave-one-out (LOO) feature attribution. STraTS achieved the highest area under the precision-recall curve (AUPRC 0.171 {+/-} 0.022), compared with Random Forest (0.166 {+/-} 0.008), logistic regression (0.151 {+/-} 0.007), and XGBoost (0.128 {+/-} 0.010). Across interpretability methods, five predictors were consistently identified: birth weight, gender, Apgar score at 1 minute, umbilical serum thyroid stimulating hormone (uS-TSH), and treatment time in hospital. Indicators of early clinical severity, including chromosomal abnormalities and neonatal cerebral-status disturbances, showed the largest risk-increasing effects. Furthermore, the models learned vector representations of subject-specific EHR sequences formed clinically coherent latent embeddings that reflect population heterogeneity along established perinatal risk dimensions. These findings demonstrate that combining multiple complementary interpretability methods yields stable, clinically plausible risk signals while revealing limitations that would remain undetected by any single approach, highlighting the importance of careful interpretability analysis of deep learning-based risk predictions.
Author summaryInfants who require intensive care after birth are more likely to develop neuropsychiatric conditions such as cerebral palsy, epilepsy, autism, or intellectual disability later in childhood. Identifying high-risk infants early could improve follow-up care, but prediction models are difficult to trust without understanding how they reach their conclusions. We used hospital records from the first 90 days of life for nearly 17,700 children to train a machine learning model that processes clinical events over time, and we applied three different methods to explain what the model learned. The model grouped children in ways that reflected known risk factors such as prematurity and severity of illness, suggesting it captures meaningful patterns beyond any single variable. Importantly, no single explanation method told the complete story: one missed rare but serious conditions because it averaged across all patients, while another produced a misleading result for gestational age because the same information was already captured by birth weight. Only by comparing methods could we detect these issues. Our key contribution is not prediction accuracy--which remains limited by the complexity of these conditions--but demonstrating that multiple complementary explanation methods are needed to produce trustworthy insights when applying machine learning to clinical data.