Preclinical study tests AI pipeline for glaucoma detection using Harvard dataset
This is a preclinical study describing a two-tier diagnostic pipeline for glaucoma detection applied to the Harvard Glaucoma Detection and Progression dataset. The first tier is a semi-supervised EfficientNetV2S classifier, and the second tier is a multi-agent system built on MedGemma 4B with three specialist agents deliberating over three rounds. The comparator was the classifier alone.
The authors report that the classifier alone had an AUC of 0.84 on 150 held-out test patients. On 124 flagged cases, the agent system achieved 100% sensitivity (55 glaucoma cases detected, zero missed) and 89.5% overall accuracy (111 correct out of 124), compared to the classifier's 73.4% accuracy (91 correct out of 124). An uncertainty analysis showed 96.3% accuracy for confident predictions (n=27) and 74.0% for uncertain predictions (n=123). The net improvement from the agent system was 20 cases (32 fixed, 12 new errors).
The authors acknowledge key limitations: a single training run without variance estimates, preliminary evidence, and results from a specific dataset that may not generalize. No safety data were reported. The practice relevance noted is that uncertainty-gated routing to vision language model agents can improve diagnostic accuracy on cases where automated classifiers are least reliable. Causal claims are not made, and results should be interpreted as preliminary.