LLM ensemble shows potential for PCI decision support in retrospective analysis of 93 patients
A retrospective cohort study at Ruijin Hospital evaluated 15 large language model (LLM) versions for percutaneous coronary intervention (PCI) decision support using data from 93 patients with moderate-to-severe coronary stenosis. The study assessed LLM behavioral patterns and performance metrics without a direct clinical comparator. Distinct behavioral patterns emerged across LLM families, with Llama-3.3-70B-Instruct making more aggressive recommendations and Grok-3 being more conservative.
The main finding was that advanced ensemble strategies surpassed individual models. An adaptive grouped ensemble achieved an F1 score of 0.921, compared to 0.807 for the best single model and 0.794 for a standard ensemble. Performance was significantly modulated by patient age, with Holm-adjusted analysis identifying performance gaps at age cut-points of 73, 75, and 76 years, and a likelihood ratio test confirming a significant age-score interaction (p = 0.00089).
Safety and tolerability data were not reported. Key limitations include the retrospective, single-center design with a small sample size of 93 patients. The study does not report clinical outcomes from using LLM recommendations or compare them to actual clinical decisions or patient outcomes. The performance metrics (F1 scores) represent model classification agreement, not clinical efficacy.
For practice, this early evidence suggests tailored LLM ensembles are technically feasible for PCI decision support and may improve robustness. However, the authors note that multicenter prospective validation and multimodal integration are needed before any clinical deployment. Clinicians should interpret these findings as preliminary computational research rather than evidence supporting clinical implementation.