LLM-based QA system shows feasibility for screening AI auto-contouring in male pelvic radiotherapy
This feasibility study evaluated a large language model (LLM)-based automated Quality Assurance (LAQUA) system for assessing AI-based auto-contouring quality in radiotherapy. The analysis used 20 male pelvic CT scans from an open dataset, with three different auto-contouring software packages. The LAQUA system's evaluations were compared against ground truth assessments by two board-certified radiation oncologists.
The LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated organs, with Spearman's rank correlation coefficients ranging from 0.567 to 0.835 and quadratic weighted kappa coefficients from 0.639 to 0.804. Using a cutoff of ≥4 as 'Pass' for screening, the system achieved a sensitivity of 0.976 for the rectum and specificity of 0.933 for the left femoral head. The LLM's rationales for its quality assessments showed good alignment with contouring quality, with an overall mean score of 1.70 ± 0.48 out of 2, and 155 of 291 outputs receiving perfect scores across all evaluation criteria.
Key limitations include potential overestimation bias with risk of missing 'Fail' cases, wide 95% confidence intervals for screening performance with a cutoff of ≥3 as 'Pass', and the small sample size of only 20 CT scans. The study did not report safety or tolerability data, as this was a technical feasibility assessment. While the system shows promise as a primary screening tool to reduce clinical workload by filtering acceptable contours, these findings are preliminary and limited to male pelvic anatomy. Further validation in larger, more diverse clinical populations is needed before considering clinical implementation.