This primary research article from Ontario, Canada, evaluated machine learning models using k-mer features derived from whole-genome sequencing (WGS) data to predict ciprofloxacin resistance in 1,424 Shigella isolates. The study compared models based on chromosomal determinants alone versus those incorporating both chromosomal and plasmid-mediated determinants, testing k-mer lengths of 11, 15, 21, and 31.
The key findings indicate that a k-mer length of 11 produced the highest area under the receiver operating characteristic curve (AUC) and the lowest Brier score for predictive performance. The Random Forest classifier achieved the most consistent performance across models. Furthermore, the inclusion of plasmid-mediated determinants improved predictive accuracy relative to chromosomal-only models.
The authors note that applications to Shigella remain limited, and the study does not report follow-up duration, effect sizes, absolute numbers, or p-values and confidence intervals for the primary outcomes. The research is confined to isolates from Ontario, Canada, and the generalizability beyond this context is uncertain.
Practice relevance suggests potential integration into genomic antimicrobial resistance surveillance and digital public health frameworks. However, the evidence is observational and specific to bacterial isolates; it does not infer clinical efficacy or patient outcomes. The findings should be interpreted cautiously, with further validation needed.
View Original Abstract ↓
Antimicrobial resistance (AMR) is a growing global public health threat that complicates the treatment and control of bacterial infections. Shigella spp., a leading cause of bacterial diarrhea worldwide, has increasingly exhibited resistance to multiple antimicrobial agents that are commonly recommended therapy for severe shigellosis. Although conventional antimicrobial susceptibility testing (AST) remains the reference standard, it is time-consuming and provides limited insight into the genetic mechanisms underlying resistance. Whole-genome sequencing (WGS) has emerged as a complementary approach for AMR detection by enabling direct identification of resistance genetic determinants encoded in bacterial genomes. Machine learning (ML) methods applied to genomic features such as k-mers have shown promise for predicting resistance phenotypes from WGS data; however, applications to Shigella remain limited. In this study, we developed and evaluated an interpretable ML framework for predicting ciprofloxacin resistance using k-mer features derived from WGS data of 1,424 Shigella isolates collected in Ontario, Canada, between 2018 and 2025. K-mers were extracted from known gene targets associated with ciprofloxacin resistance, including chromosomal quinoline resistance-determining regions (QRDRs: gyrA and parC) and plasmid-mediated determinants (qnr).
Supervised ML approaches were trained and compared. We evaluated the influence of k-mer lengths (k=11, 15, 21 and 31) on predictive performance and model interpretability; and compared models based on chromosomal determinants alone and models incorporating both chromosomal and plasmid-mediated determinants. Randon Forest classifier achieved the most consistent performance across models. Inclusion of plasmid-mediated determinants improved predictive accuracy relative to chromosomal-only models. Although differences across k-mer lengths were modest, k = 11 produced the highest area under the receiver operating characteristic curve (AUC) and the lowest Brier score. SHAP analyses localized high-impact features within QRDRs of gyrA and parC, supporting biological interpretability. These findings demonstrate that biologically-informed k-mer-based ML models can accurately and transparently predict ciprofloxacin resistance in Shigella, supporting their potential integration into genomic AMR surveillance and digital public health frameworks.
Author summaryIn this study, we used genome sequencing data to develop machine learning models that predict ciprofloxacin resistance for Shigella directly from bacterial DNA. We focused on small DNA fragments (k-mers) derived from known resistance genes and mutations. Among the approaches tested, a Random Forest model showed the most consistent performance. Combining chromosomal mutations with plasmid-mediated resistance genes improved prediction accuracy and helped identify key genetic regions associated with resistance. These findings demonstrate that machine learning applied to genomic data can accurately and interpretable predict antibiotic resistance, supporting its potential use in genomic surveillance and public health monitoring.