Mode
Text Size
Log in / Sign up

LLM-based QA system shows feasibility for screening AI auto-contouring in male pelvic radiotherapy

LLM-based QA system shows feasibility for screening AI auto-contouring in male pelvic radiotherapy
Photo by Leo_Visions / Unsplash
Key Takeaway
Consider LLM-based QA for auto-contouring screening in male pelvis, but recognize this is preliminary feasibility data.

This feasibility study evaluated a large language model (LLM)-based automated Quality Assurance (LAQUA) system for assessing AI-based auto-contouring quality in radiotherapy. The analysis used 20 male pelvic CT scans from an open dataset, with three different auto-contouring software packages. The LAQUA system's evaluations were compared against ground truth assessments by two board-certified radiation oncologists.

The LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated organs, with Spearman's rank correlation coefficients ranging from 0.567 to 0.835 and quadratic weighted kappa coefficients from 0.639 to 0.804. Using a cutoff of ≥4 as 'Pass' for screening, the system achieved a sensitivity of 0.976 for the rectum and specificity of 0.933 for the left femoral head. The LLM's rationales for its quality assessments showed good alignment with contouring quality, with an overall mean score of 1.70 ± 0.48 out of 2, and 155 of 291 outputs receiving perfect scores across all evaluation criteria.

Key limitations include potential overestimation bias with risk of missing 'Fail' cases, wide 95% confidence intervals for screening performance with a cutoff of ≥3 as 'Pass', and the small sample size of only 20 CT scans. The study did not report safety or tolerability data, as this was a technical feasibility assessment. While the system shows promise as a primary screening tool to reduce clinical workload by filtering acceptable contours, these findings are preliminary and limited to male pelvic anatomy. Further validation in larger, more diverse clinical populations is needed before considering clinical implementation.

Study Details

EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Purpose: Manual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study developed the large language model (LLM)-based automated Quality Assurance (QA) for auto-contouring (LAQUA) system using a multimodal LLM, Gemini 2.5 Pro, and evaluated its feasibility as a clinical primary screening tool to streamline the QA workflow. Methods: Twenty male pelvic CT scans from an open dataset were utilized. Three distinct auto-contouring software packages (OncoStudio, RatoGuide prototype and syngo.via) were evaluated. Auto-contouring results for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Pro. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Optimal; 4: Acceptable; 3: Suboptimal; 2: Unacceptable; redraw from scratch; 1: Unacceptable; organ not detected). Using evaluations by two board-certified radiation oncologists as ground truth, Spearman's rank correlation coefficients ({rho}) and weighted kappa coefficients ({kappa}) were calculated. Additionally, to assess screening performance, sensitivity and specificity were calculated by dichotomizing the scores into "Pass" and "Fail" using two different cutoffs (scores [≥] 3 and [≥] 4 as "Pass"). Finally, the alignment of the rationales provided by the LLM with the auto-contouring quality was evaluated by two board-certified radiation oncologists. This was conducted using a Likert scale assessing four domains (error detection, hallucination, clinical relevance, and anatomical understanding), each scored out of 2 points. Results: The LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated organs ({rho}: 0.567 - 0.835; quadratic weighted {kappa} : 0.639 - 0.804), with the rectum showing the highest correlation. Regarding screening performance, a cutoff of [≥]3 as "Pass" achieved the highest sensitivity and specificity in specific subgroups, but with wide 95% confidence intervals (CIs). A cutoff of [≥]4 as "Pass" narrowed the CIs, yielding the highest sensitivity in the rectum (0.976) and the highest specificity in the left femoral head (0.933). Qualitatively, the LLM's rationales achieved an overall mean score of 1.70 {+/-} 0.48 (out of 2), with 155 of 291 outputs receiving perfect scores across all criteria. Conclusions: The LAQUA system demonstrated substantial agreement with expert evaluations in AI-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "Fail" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.