Machine learning algorithms predicted stroke risk with high accuracy in a large retrospective cohort.
This retrospective multicenter study analyzed data from 35,859 participants within a high-quality health database to assess stroke risk prediction. The investigation compared multiple machine learning algorithms, including random forest, against established models. The primary outcome measured was the area under the curve (AUC) for stroke risk prediction.
The random forest model demonstrated the best performance among the evaluated algorithms, achieving an AUC of 0.97. During the observation period, 781 participants, representing 2.2% of the cohort, experienced a stroke. The dataset was initially incomplete and class imbalanced; therefore, extreme outliers and noises were eliminated, missing values were imputed, and the Synthetic Minority Over-sampling Technique was used to generate a balanced dataset for analysis.
Safety and tolerability data were not reported in this study. Key limitations include the incomplete nature of the dataset, the use of data imputation and oversampling techniques, and the elimination of extreme outliers. The authors note that future studies should further validate and optimize the current model to assess its generalizability across different populations.
The study suggests these tools may facilitate the application of clinical practice guidelines and shared decision-making. However, given the observational nature of the data and the specific preprocessing steps taken, clinicians should interpret these results with caution until external validation confirms the model's performance in diverse settings.