Review of multimodal LLMs for ABR threshold estimation shows high correlation but limited clinical reliabilityCan ChatGPT Read Your Hearing Test?

medRxiv Published April 17, 2026 Study authors: Jedrzejczak, W.; Kochanek, K.; Skarzynski, H. DOI ↗ Editorial oversight: Dr. Lars van Dijk, PhD · Surgical, Procedural & Diagnostic

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway

Consider that multimodal LLMs show high correlation but limited specificity for ABR threshold estimation, requiring expert oversight.

This review summarizes an observational study evaluating two general-purpose multimodal large language model (LLM) chatbots—ChatGPT (version 5.1) and Qwen (version 3Max)—for estimating auditory brainstem response (ABR) hearing thresholds. The analysis used 500 images, each containing several ABR waveforms, with judgments from 3 expert audiologists as the comparator. The primary outcome was accuracy of ABR hearing threshold estimation, with secondary outcomes including mean errors, correlations, and sensitivity/specificity for detecting hearing loss (>20 dB nHL).

The authors synthesized key findings: both models showed high Pearson correlations with expert thresholds (ChatGPT: r = 0.954; Qwen: r = 0.958). Mean error in threshold estimation was +5.5 dB for ChatGPT and -2.7 dB for Qwen. Exact nominal agreement with expert values was low (ChatGPT: 34.6%; Qwen: 35.6%), but agreement within ±10 dB was higher (ChatGPT: 75.6%; Qwen: 80.0%). For hearing loss classification, sensitivity was high (ChatGPT: 100%; Qwen: 91.6%), but specificity was limited (ChatGPT: 20.4%; Qwen: 67.5%).

The authors note several limitations: performance was not good enough for independent clinical use, systematic bias, limited specificity, sensitivity balance issues, and poor latency estimation. No safety data were reported. The practice relevance suggests that while general-purpose LLMs may have potential as assistive or preliminary tools, clinically reliable ABR interpretation will likely require specialized, domain-trained AI systems with expert oversight.

The AI Guess

Imagine sitting in a quiet room. A doctor plays beeps through your ear. You nod when you hear them. This is how we usually check your hearing. But what if you can't nod? What if you are a baby, a child, or someone who is very confused?

Doctors then use a different test. They stick sensors on your head. These sensors record tiny electrical signals from your brain. We call this an Auditory Brainstem Response, or ABR.

Doctors look at the squiggly lines on a screen. They find the right spots to tell how well you hear. But this takes time. It also needs a very skilled expert.

Hearing loss is very common. Millions of people struggle with it every day. Many of them need these special brain tests. But there are not enough expert audiologists to do them all.

Waiting for a specialist can take weeks. This delays getting the right help. Patients feel frustrated. They want answers fast.

We need a faster way. We need a way that works even when a human expert is busy or far away. That is why this new study is so important. It asks a big question.

The Surprising Shift

Can a chatbot do this job? Can an AI like ChatGPT or Qwen read the brain waves?

Usually, we think AI is good at writing text. We think it is bad at looking at pictures. But these new AI models can see images too. They are called multimodal large language models.

Researchers gave these AI tools pictures of ABR tests. They asked the AI to guess the hearing level. They compared the AI's guesses to what three human experts said.

Think of the ABR wave like a hill on a map. The highest point of the hill tells us something about your hearing. Doctors look for this specific peak.

The AI looked at the picture. It tried to find that same peak. It then guessed how loud the sound had to be for you to hear it.

The AI did not know your name. It did not know your age. It just saw the lines on the screen. It had to guess based only on the shape of the waves.

The results were mixed. The AI was very good at guessing the general idea. It matched the experts 75% to 80% of the time. That is a high score for a computer program.

One AI, Qwen, was slightly more accurate than the other. It made smaller mistakes on average. The other AI, ChatGPT, was a bit more generous with its guesses.

But there was a big problem. The AI was very bad at measuring the exact timing of the waves. It missed the fine details that doctors need to see.

This doesn't mean this treatment is available yet.

That is a crucial point to remember. The AI is not ready to replace doctors. It cannot be used alone in a clinic today.

The AI often missed cases where people had hearing loss. It said "no problem" when there was actually a problem. This is called low specificity. It is dangerous to miss a real issue.

This news is not about buying a new app for your phone. It is about the future of medical care.

If you need a hearing test soon, talk to your doctor. Do not wait for an AI to do it. Your doctor is the best person to help you.

However, this research shows a path forward. In the future, AI might help doctors work faster. It could flag cases that need urgent attention. It could help in places where there are no specialists.

Scientists know the current AI models are not perfect. They need more training. They need to learn from thousands of real cases.

The next step is to build special AI tools. These tools will be trained just for hearing tests. They will be safer and more accurate.

It will take time to get these tools approved. Safety comes first. We must make sure the AI never misses a patient.

Until then, trust your doctor. They are the experts who care about your health. This new technology will one day help them do their job better. But for now, human eyes are still the best.

Study Details

EvidenceLevel 5

PublishedApr 2026

View Original Abstract ↓

Introduction: Auditory brainstem response (ABR) is a standard objective method for estimating hearing threshold, especially in patients who cannot reliably participate in behavioral audiometry. However, ABR interpretation is usually performed by an expert. This study evaluated whether two general-purpose artificial intelligence (AI) multimodal large language model (LLM) chatbots, ChatGPT and Qwen, can accurately estimate ABR hearing thresholds from ABR waveform images. The accuracy was measured by comparisons with the judgements of 3 expert audiologists. Methods: A total of 500 images each containing several ABR waveforms recorded at different stimulus intensities were analyzed. Three expert audiologists established the reference auditory thresholds based on visual identification of wave V at the lowest stimulus intensity, with the most frequent judgment among the three used as the reference. Each waveform image was independently submitted to ChatGPT (version 5.1) and Qwen (version 3Max) using the same standardized prompt and without additional clinical context. Agreement with the expert thresholds was assessed as mean errors and correlations. Sensitivity and specificity for detecting hearing loss (>20 dB nHL) were also calculated. In cases where the AI and expert thresholds nominally matched, corresponding latency measures were also compared. Results: Auditory thresholds derived from both LLMs correlated strongly with expert opinion, with Pearson r = 0.954 for ChatGPT and r = 0.958 for Qwen. ChatGPT showed a mean error of +5.5 dB and Qwen showed a mean error of -2.7 dB. Exact nominal agreement with expert values was achieved in 34.6% of ChatGPT estimates and 35.6% of Qwen estimates; agreement within +/-10 dB was observed in 75.6% and 80.0% of cases, respectively. For hearing-loss classification, ChatGPT achieved 100% sensitivity but low specificity (20.4%), whereas Qwen showed a more balanced profile with 91.6% sensitivity and 67.5% specificity. Curiously, estimates of wave V latency were markedly poor for both LLMs, with systematic underestimation and weak correlations with the expert judgements. Conclusion: ChatGPT and Qwen demonstrated a moderate ability to estimate ABR thresholds from waveform images, although their performance was not good enough for independent clinical use. Both models captured general patterns of hearing loss severity, but there was systematic bias, limited specificity and sensitivity balance, and poor latency estimation. General-purpose multimodal LLMs may have potential as assistive or preliminary tools, but clinically reliable ABR interpretation will likely require specialized, domain-trained AI systems with expert oversight.

Review of multimodal LLMs for ABR threshold estimation shows high correlation but limited clinical reliabilityCan ChatGPT Read Your Hearing Test?

The AI Guess

The Surprising Shift

Study Details

Global GJB2 mutation prevalence in nonsyndromic hearing impairment

One in four people with hearing loss carries a specific gene mutation

Clinical research that matters. Delivered to your inbox.

Review of multimodal LLMs for ABR threshold estimation shows high correlation but limited clinical reliabilityCan ChatGPT Read Your Hearing Test?

The AI Guess

The Surprising Shift

Study Details

Global GJB2 mutation prevalence in nonsyndromic hearing impairment

One in four people with hearing loss carries a specific gene mutation

Clinical research that matters. Delivered to your inbox.

Related in ENT (Otolaryngology)

From Other Specialties