LLM-based QA system shows feasibility for screening AI auto-contouring in male pelvic radiotherapyCan an AI assistant help doctors check AI-drawn cancer treatment maps?

medRxiv Published April 3, 2026 Study authors: Tozuka, R.; Akita, T.; Matsuda, M.; Tanno, H.; Saito, M.; Nemoto, H.; Mitsuda, K.; Kadoya, N.; Jingu… DOI ↗ Editorial oversight: Dr. Lars van Dijk, PhD · Surgical, Procedural & Diagnostic

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway

Consider LLM-based QA for auto-contouring screening in male pelvis, but recognize this is preliminary feasibility data.

This feasibility study evaluated a large language model (LLM)-based automated Quality Assurance (LAQUA) system for assessing AI-based auto-contouring quality in radiotherapy. The analysis used 20 male pelvic CT scans from an open dataset, with three different auto-contouring software packages. The LAQUA system's evaluations were compared against ground truth assessments by two board-certified radiation oncologists.

The LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated organs, with Spearman's rank correlation coefficients ranging from 0.567 to 0.835 and quadratic weighted kappa coefficients from 0.639 to 0.804. Using a cutoff of ≥4 as 'Pass' for screening, the system achieved a sensitivity of 0.976 for the rectum and specificity of 0.933 for the left femoral head. The LLM's rationales for its quality assessments showed good alignment with contouring quality, with an overall mean score of 1.70 ± 0.48 out of 2, and 155 of 291 outputs receiving perfect scores across all evaluation criteria.

Key limitations include potential overestimation bias with risk of missing 'Fail' cases, wide 95% confidence intervals for screening performance with a cutoff of ≥3 as 'Pass', and the small sample size of only 20 CT scans. The study did not report safety or tolerability data, as this was a technical feasibility assessment. While the system shows promise as a primary screening tool to reduce clinical workload by filtering acceptable contours, these findings are preliminary and limited to male pelvic anatomy. Further validation in larger, more diverse clinical populations is needed before considering clinical implementation.

Imagine a doctor spending hours meticulously checking maps of a patient's organs, drawn by another AI, to prepare for precise radiation therapy. It's a critical but time-consuming task. A new study tested whether a different kind of AI—a large language model—could act as a first-line assistant to help with this quality check.

The research used 20 male pelvic CT scans. The AI assistant, called LAQUA, was asked to evaluate the quality of organ outlines created by other auto-contouring software. When compared to the judgments of two expert radiation oncologists, the AI's ratings showed moderate to strong agreement. In a specific test, it was very good at correctly identifying acceptable contours for the rectum and at spotting problematic ones for the left femoral head. The AI also provided written explanations for its ratings, which human reviewers found were generally well-aligned with the actual contour quality.

However, this is a very early, small-scale look at the idea. The study used only 20 scans from one part of the body (the male pelvis), so we don't know if it would work for other cancers or for female patients. The researchers themselves caution that their method might miss some poor-quality contours, and some performance metrics had wide confidence intervals, meaning we need more data to be sure. No safety issues were reported, but this was a technical analysis of scans, not a treatment trial. The core finding is simply that this approach is feasible enough to explore further as a potential time-saver for busy clinicians.

What this means for you:

An AI checker showed promise for helping doctors review AI-drawn radiation maps, but it's an early test.

Study Details

EvidenceLevel 5

PublishedApr 2026

View Original Abstract ↓

Purpose: Manual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study developed the large language model (LLM)-based automated Quality Assurance (QA) for auto-contouring (LAQUA) system using a multimodal LLM, Gemini 2.5 Pro, and evaluated its feasibility as a clinical primary screening tool to streamline the QA workflow. Methods: Twenty male pelvic CT scans from an open dataset were utilized. Three distinct auto-contouring software packages (OncoStudio, RatoGuide prototype and syngo.via) were evaluated. Auto-contouring results for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Pro. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Optimal; 4: Acceptable; 3: Suboptimal; 2: Unacceptable; redraw from scratch; 1: Unacceptable; organ not detected). Using evaluations by two board-certified radiation oncologists as ground truth, Spearman's rank correlation coefficients ({rho}) and weighted kappa coefficients ({kappa}) were calculated. Additionally, to assess screening performance, sensitivity and specificity were calculated by dichotomizing the scores into "Pass" and "Fail" using two different cutoffs (scores [≥] 3 and [≥] 4 as "Pass"). Finally, the alignment of the rationales provided by the LLM with the auto-contouring quality was evaluated by two board-certified radiation oncologists. This was conducted using a Likert scale assessing four domains (error detection, hallucination, clinical relevance, and anatomical understanding), each scored out of 2 points. Results: The LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated organs ({rho}: 0.567 - 0.835; quadratic weighted {kappa} : 0.639 - 0.804), with the rectum showing the highest correlation. Regarding screening performance, a cutoff of [≥]3 as "Pass" achieved the highest sensitivity and specificity in specific subgroups, but with wide 95% confidence intervals (CIs). A cutoff of [≥]4 as "Pass" narrowed the CIs, yielding the highest sensitivity in the rectum (0.976) and the highest specificity in the left femoral head (0.933). Qualitatively, the LLM's rationales achieved an overall mean score of 1.70 {+/-} 0.48 (out of 2), with 155 of 291 outputs receiving perfect scores across all criteria. Conclusions: The LAQUA system demonstrated substantial agreement with expert evaluations in AI-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "Fail" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.

LLM-based QA system shows feasibility for screening AI auto-contouring in male pelvic radiotherapyCan an AI assistant help doctors check AI-drawn cancer treatment maps?

Study Details

Meta-analysis reveals CMR and biomarkers key in ICI myocarditis

Heart MRI and Blood Tests Help Detect Immunotherapy Heart Inflammation

Clinical research that matters. Delivered to your inbox.

LLM-based QA system shows feasibility for screening AI auto-contouring in male pelvic radiotherapyCan an AI assistant help doctors check AI-drawn cancer treatment maps?

Study Details

Meta-analysis reveals CMR and biomarkers key in ICI myocarditis

Heart MRI and Blood Tests Help Detect Immunotherapy Heart Inflammation

Clinical research that matters. Delivered to your inbox.

Related in Radiology & Imaging

From Other Specialties