Frozen visual encoders compared for thyroid nodule classification in the ThyroidEffi 1.0 datasetWill AI Make Your Thyroid Biopsy Report More Trustworthy?

Frontiers in Medicine Published April 13, 2026 DOI ↗ Editorial oversight: Dr. Amelia Tan, PhD · Internal Medicine & Chronic Disease

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway

Consider encoder selection based on calibration and suspicious class sensitivity rather than aggregate accuracy alone.

This study utilized the ThyroidEffi 1.0 dataset to compare frozen visual encoders for classifying thyroid nodules. The population included cases categorized as Bethesda II (Benign), Bethesda V (Suspicious), and Bethesda VI (Malignant). The analysis involved four models: MedSigLIP, a medical image–text pretrained encoder; and ImageNet-pretrained ResNet50, EfficientNet-B0, and ViT-Base. The primary outcome was Macro-F1, with secondary outcomes including balanced accuracy, Expected Calibration Error (ECE), and recall for the Suspicious class.

Regarding aggregate classification accuracy, EfficientNet-B0 achieved a Macro-F1 of 0.845 ± 0.021, outperforming ViT (0.817 ± 0.020) with a p-value less than 0.05. The difference between EfficientNet-B0 and MedSigLIP (0.836 ± 0.019) was not statistically significant after multiple comparison correction. ResNet50 scored 0.829 ± 0.015. For the recall of the Suspicious class, MedSigLIP achieved 0.808, the highest among the models. In terms of Expected Calibration Error, MedSigLIP showed the best performance at 0.025, compared to 0.044–0.082 for the general-purpose encoders.

Safety and tolerability data were not reported, as adverse events and discontinuations are not applicable to computational model evaluation. A key limitation is the lack of prospective validation in real-world triage workflows. The study notes that MedSigLIP did not yield a statistically significant advantage in aggregate classification accuracy compared to the best ImageNet-based model. Consequently, practice relevance dictates that encoder selection should be guided by a joint view of discrimination and safety, particularly calibration and Bethesda V sensitivity, rather than aggregate accuracy alone.

The waiting room moment

You feel a lump in your neck. Your doctor orders a fine-needle biopsy. A few days later you get the call.

"The results are indeterminate."

That word — indeterminate — means the lab sees cells that might be cancer but cannot be sure. It is stressful. It often leads to more tests or surgery.

A new study asks a hopeful question. Can AI make these cloudy answers clearer?

Thyroid nodules are common. Most are harmless, but some are cancer. To tell them apart, doctors use a needle biopsy (FNAB — a thin needle that pulls out a few cells).

The cells are sorted into six Bethesda categories. Category II is benign. Category VI is malignant. Category V — "suspicious for malignancy" — is the gray zone.

Bethesda V is the hardest call. And it is the one where getting a second opinion matters most.

AI tools have started helping pathologists read these slides. But not all AI is the same.

The old way versus the new way

For years, AI image tools have been trained on ImageNet — a huge collection of everyday photos: cats, cars, chairs. These general tools were then fine-tuned for medical images.

It worked surprisingly well. But here's the twist. A new group of AI models is trained from the start on medical images paired with medical text. One of these is MedSigLIP.

The question is simple. Does a medical-specialist AI beat a general AI at reading thyroid slides?

Think of it like hiring a reader. One reader has looked at a million pictures of all kinds. The other has only studied medical atlases.

You might guess the medical reader wins. But the general reader has seen so much more that they have picked up strong habits.

In this study, researchers "froze" each AI's core. That means they did not retrain it. They just added a small decision layer on top — like handing the reader a checklist and asking for a verdict.

The team tested four AIs on a dataset called ThyroidEffi 1.0, with 1,804 thyroid biopsy images. The images covered benign, suspicious, and malignant cases.

They compared three ImageNet-trained models (ResNet50, EfficientNet-B0, and ViT-Base) against MedSigLIP, the medical specialist. They measured accuracy, reliability, and how well each model handled the gray-zone Bethesda V slides.

On raw accuracy, EfficientNet — a general-purpose AI — came out on top with a score of 0.845. MedSigLIP came in a close second at 0.836. The difference was small enough that statisticians called it a tie.

So the medical AI did not win the overall accuracy race.

But this is where the story gets more interesting.

MedSigLIP was much better at two things that matter deeply for patients. It was more reliable — meaning when it said "70% sure this is cancer," it was actually right about 70% of the time. The general models were overconfident.

And it caught more of the hard Bethesda V cases. Its recall on suspicious slides was 0.808, the highest of the group.

The surprising shift

In AI, people love to compare accuracy scores. Higher is better, right?

Not always. A doctor would rather have an AI that says "I'm unsure, please check" than one that confidently gives a wrong answer.

This is called calibration (how well a model's confidence matches reality). MedSigLIP's calibration error was 0.025. The general models ranged from 0.044 to 0.082. Lower is better — and MedSigLIP was nearly three times better than the weakest competitor.

For patients in the Bethesda V gray zone, that extra humility could mean fewer missed cancers and fewer unnecessary surgeries.

The researchers say this study changes how we should judge medical AI. Accuracy alone is not enough.

In real clinics, AI will not replace pathologists. It will support them by flagging cases that need a second look. For that role, knowing when to raise a hand is more valuable than hitting a slightly higher accuracy score on the easy cases.

They call this a "safety-first" view of AI. It matches how careful clinicians already work.

If you are facing a thyroid biopsy today, this research does not change what happens in the lab. MedSigLIP is not yet running in your hospital's pathology system.

But the direction is clear. Within a few years, AI tools like this may help pathologists triage hard cases. That could mean faster answers on easy biopsies and more careful expert review on the tricky ones.

For now, trust your care team. If your result is indeterminate, ask about a second opinion or a molecular test. Those tools are available now and already improve clarity.

Limitations to keep in mind

This was a benchmark study, not a clinical trial. The AI read stored images, not fresh patient samples.

The dataset had 1,804 slides from one source. Real-world images vary in quality, staining, and equipment.

And the AI was frozen — not fine-tuned for thyroid tissue. With training, general models might close the calibration gap, or medical models might pull further ahead.

The next step is testing in real pathology labs, with real slides, alongside real pathologists. Researchers also want to see how calibration holds up across different hospitals, scanners, and patient groups.

If MedSigLIP and similar medical-trained models keep their reliability edge, they may earn a spot as "safety checkers" in the workflow. A pathologist reads the slide. The AI quietly agrees — or politely raises a flag.

That is a small role. But for a patient in the Bethesda V waiting room, small roles change big outcomes.

Study Details

Study typeCohort

EvidenceLevel 3

PublishedApr 2026

View Original Abstract ↓

BackgroundFine-needle aspiration biopsy (FNAB) cytology is central to thyroid nodule evaluation, yet reliable differentiation across Bethesda categories remains challenging, particularly for the indeterminate Bethesda V (Suspicious for Malignancy) class. While transfer learning with ImageNet-pretrained models is a standard approach, it remains unclear whether emerging domain-specific medical foundation models offer superior performance compared to general purpose baselines in this specialized domain.MethodsWe benchmarked four frozen visual encoders—ResNet50, EfficientNet-B0, ViT-Base (ImageNet pretrained), and MedSigLIP (medical image–text pretrained)—on the ThyroidEffi 1.0 dataset (N = 1,804), comprising Bethesda II (Benign), Bethesda V (Suspicious), and Bethesda VI (Malignant) cases. A unified evaluation protocol was employed using five-fold stratified cross-validation with a lightweight multilayer perceptron head. Performance was assessed using macro-F1, balanced accuracy, Expected Calibration Error (ECE), and McNemar’s test for statistical significance.ResultsEfficientNet achieved the highest macro-F1 (0.845 ± 0.021), followed closely by MedSigLIP (0.836 ± 0.019), ResNet50 (0.829 ± 0.015), and ViT (0.817 ± 0.020). Pairwise statistical testing revealed that while EfficientNet significantly outperformed ViT (p < 0.05), the difference between EfficientNet and MedSigLIP was not statistically significant after multiple comparison correction. Notably, MedSigLIP demonstrated superior reliability attributes, achieving the highest recall for the challenging Suspicious class (0.808) and the best model calibration score (ECE = 0.025) compared to the general-purpose encoders (ECE: 0.044–0.082).ConclusionsWhile domain-specific medical pretraining (MedSigLIP) did not yield a statistically significant advantage in aggregate classification accuracy compared to the best ImageNet-based model (EfficientNet), it provided superior calibration and sensitivity for borderline cases. These findings suggest that in thyroid cytology clinical workflow support, encoder selection should be guided by a joint view of discrimination and safety—particularly calibration and Bethesda V sensitivity—rather than aggregate accuracy alone, enabling threshold-based triage and selective expert review. In particular, well-calibrated models such as MedSigLIP suggest a potential benefit in reducing overconfident misclassification in borderline Bethesda V cases, pending prospective validation in real-world triage workflows.

Frozen visual encoders compared for thyroid nodule classification in the ThyroidEffi 1.0 datasetWill AI Make Your Thyroid Biopsy Report More Trustworthy?

The waiting room moment

The old way versus the new way

The surprising shift

Limitations to keep in mind

Study Details

AI-OCT reports reduced false-positive DME referral rates from 69.1% to 24.1%

AI technology reduces unnecessary referrals for diabetic macular edema

Clinical research that matters. Delivered to your inbox.

Frozen visual encoders compared for thyroid nodule classification in the ThyroidEffi 1.0 datasetWill AI Make Your Thyroid Biopsy Report More Trustworthy?

The waiting room moment

The old way versus the new way

The surprising shift

Limitations to keep in mind

More on Thyroid Nodules

Study Details

AI-OCT reports reduced false-positive DME referral rates from 69.1% to 24.1%

AI technology reduces unnecessary referrals for diabetic macular edema

Clinical research that matters. Delivered to your inbox.

Related in Diabetes & Endocrinology

From Other Specialties