Mode
Text Size
Log in / Sign up

LLM ensemble shows potential for PCI decision support in retrospective analysis of 93 patientsAI models show promise for heart procedure decisions but need more testing

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Consider LLM ensembles for PCI support as early research; prospective validation needed before clinical use.

A retrospective cohort study at Ruijin Hospital evaluated 15 large language model (LLM) versions for percutaneous coronary intervention (PCI) decision support using data from 93 patients with moderate-to-severe coronary stenosis. The study assessed LLM behavioral patterns and performance metrics without a direct clinical comparator. Distinct behavioral patterns emerged across LLM families, with Llama-3.3-70B-Instruct making more aggressive recommendations and Grok-3 being more conservative.

The main finding was that advanced ensemble strategies surpassed individual models. An adaptive grouped ensemble achieved an F1 score of 0.921, compared to 0.807 for the best single model and 0.794 for a standard ensemble. Performance was significantly modulated by patient age, with Holm-adjusted analysis identifying performance gaps at age cut-points of 73, 75, and 76 years, and a likelihood ratio test confirming a significant age-score interaction (p = 0.00089).

Safety and tolerability data were not reported. Key limitations include the retrospective, single-center design with a small sample size of 93 patients. The study does not report clinical outcomes from using LLM recommendations or compare them to actual clinical decisions or patient outcomes. The performance metrics (F1 scores) represent model classification agreement, not clinical efficacy.

For practice, this early evidence suggests tailored LLM ensembles are technically feasible for PCI decision support and may improve robustness. However, the authors note that multicenter prospective validation and multimodal integration are needed before any clinical deployment. Clinicians should interpret these findings as preliminary computational research rather than evidence supporting clinical implementation.

Doctors sometimes use a procedure called percutaneous coronary intervention (PCI) to open blocked heart arteries. Deciding which patients need it can be complex. Researchers wanted to see if artificial intelligence (AI) language models could help with these decisions. They tested 15 different AI models using past medical records from 93 patients at one hospital in China.

The study found that different AI models gave different recommendations. Some were more aggressive, suggesting the procedure more often, while others were more conservative. When the researchers combined several models in a smart way, this 'ensemble' performed better than any single model. However, the AI's performance was clearly affected by the patient's age, with specific age points (like 73, 75, and 76) showing gaps in how well it worked.

It's important to understand what this study did and did not show. The research only looked at how well the AI models agreed with each other on classifying cases; it did not test whether following the AI's advice would lead to better patient health. No safety issues were reported because the AI was not actually used on patients. The main reason for caution is that this was a small, single-center study using old data. The results are a technical first step, not proof that AI is ready to help make these medical decisions. Readers should see this as an early look at a possible future tool that requires much more testing in real-world, diverse hospital settings before it could ever be considered for use.

What this means for you:
Early AI research for heart procedure decisions shows potential but is not ready for real-world medical use.

Study Details

Study typeCohort
EvidenceLevel 3
PublishedApr 2026
View Original Abstract ↓
BackgroundClinical decision-making for percutaneous coronary intervention (PCI) in patients with moderate-to-severe coronary stenosis is complex and sensitive to data completeness and guideline interpretation. We aimed to evaluate large language models (LLMs) for PCI support and to develop an ensemble framework for this complex decision setting.MethodsIn this retrospective study, 15 LLM versions were evaluated using data of 93 patients from Ruijin Hospital. A hierarchical framework was employed to assess performance across varying data inputs. To optimize accuracy, advanced grouped ensemble strategies were developed and validated via nested repeated stratified 5-fold cross-validation. Probabilistic reliability and clinical utility were quantified through calibration plots and Decision Curve Analysis (DCA). Statistical robustness was ensured by bootstrap ROC-AUC comparisons with Holm-Bonferroni adjustment and restricted cubic spline modeling to analyze age-performance interactions.ResultsDistinct behavioral patterns emerged across LLM families: Llama-3.3-70B-Instruct made more aggressive recommendations, whereas Grok-3 was more conservative. Holm-adjusted analysis identified significant performance gaps at age cut-points of 73, 75, and 76. A significant age-score interaction (LRT p = 0.00089) confirmed that patient age modulates model performance. The advanced ensemble strategies surpassed individual models, with an adaptive grouped ensemble achieving an F1 score of 0.921, compared to 0.807 for the best single model and 0.794 for a standard ensemble.ConclusionTailored LLM ensembles are feasible for PCI decision support and can improve robustness. Further multicenter prospective validation and multimodal integration are needed before clinical deployment.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.