Mode
Text Size
Log in / Sign up

Machine learning algorithms predicted stroke risk with high accuracy in a large retrospective cohort.

Machine learning algorithms predicted stroke risk with high accuracy in a large retrospective cohort…
Photo by Mekht / Unsplash
Key Takeaway
Consider machine learning models for stroke risk prediction, but await external validation before clinical implementation.

This retrospective multicenter study analyzed data from 35,859 participants within a high-quality health database to assess stroke risk prediction. The investigation compared multiple machine learning algorithms, including random forest, against established models. The primary outcome measured was the area under the curve (AUC) for stroke risk prediction.

The random forest model demonstrated the best performance among the evaluated algorithms, achieving an AUC of 0.97. During the observation period, 781 participants, representing 2.2% of the cohort, experienced a stroke. The dataset was initially incomplete and class imbalanced; therefore, extreme outliers and noises were eliminated, missing values were imputed, and the Synthetic Minority Over-sampling Technique was used to generate a balanced dataset for analysis.

Safety and tolerability data were not reported in this study. Key limitations include the incomplete nature of the dataset, the use of data imputation and oversampling techniques, and the elimination of extreme outliers. The authors note that future studies should further validate and optimize the current model to assess its generalizability across different populations.

The study suggests these tools may facilitate the application of clinical practice guidelines and shared decision-making. However, given the observational nature of the data and the specific preprocessing steps taken, clinicians should interpret these results with caution until external validation confirms the model's performance in diverse settings.

Study Details

Study typeCohort
EvidenceLevel 3
PublishedApr 2026
View Original Abstract ↓
Our study aims to develop a stroke risk prediction model by multiple machine learning algorithms and optimize the model as a stroke risk prediction tool. This retrospective multicenter study derived the original dataset from a high-quality health database. The dataset was incomplete and class imbalanced. Firstly, we eliminated extreme outliers and noises and imputed missing values by appropriate algorithms. We further used Synthetic Minority Over-sampling Technique to generate a balanced dataset. Secondly, we fitted seven algorithms to develop a machine learning-based prediction tool for clinical practice. Overall, 35,859 participants were included, of whom 781 (2.2%) experienced a stroke. The random forest model demonstrated the best performance with high predictive value and discrimination ability. For stroke risk prediction, the AUC of the best-performing model was 0.97. A new random forest algorithms-based stroke risk prediction model using easily obtainable data was developed and outperformed established models. Future studies should further validate and optimize the current model to assess its generalizability and promote the wide application. The utilization of proposed random forest algorithms as an individualized risk prediction model could facilitate the application of clinical practice guidelines and shared decision-making.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.