Mode
Text Size
Log in / Sign up

Review of multimodal LLMs for ABR threshold estimation shows high correlation but limited clinical reliability

Review of multimodal LLMs for ABR threshold estimation shows high correlation but limited clinical r…
Photo by Annie Spratt / Unsplash
Key Takeaway
Consider that multimodal LLMs show high correlation but limited specificity for ABR threshold estimation, requiring expert oversight.

This review summarizes an observational study evaluating two general-purpose multimodal large language model (LLM) chatbots—ChatGPT (version 5.1) and Qwen (version 3Max)—for estimating auditory brainstem response (ABR) hearing thresholds. The analysis used 500 images, each containing several ABR waveforms, with judgments from 3 expert audiologists as the comparator. The primary outcome was accuracy of ABR hearing threshold estimation, with secondary outcomes including mean errors, correlations, and sensitivity/specificity for detecting hearing loss (>20 dB nHL).

The authors synthesized key findings: both models showed high Pearson correlations with expert thresholds (ChatGPT: r = 0.954; Qwen: r = 0.958). Mean error in threshold estimation was +5.5 dB for ChatGPT and -2.7 dB for Qwen. Exact nominal agreement with expert values was low (ChatGPT: 34.6%; Qwen: 35.6%), but agreement within ±10 dB was higher (ChatGPT: 75.6%; Qwen: 80.0%). For hearing loss classification, sensitivity was high (ChatGPT: 100%; Qwen: 91.6%), but specificity was limited (ChatGPT: 20.4%; Qwen: 67.5%).

The authors note several limitations: performance was not good enough for independent clinical use, systematic bias, limited specificity, sensitivity balance issues, and poor latency estimation. No safety data were reported. The practice relevance suggests that while general-purpose LLMs may have potential as assistive or preliminary tools, clinically reliable ABR interpretation will likely require specialized, domain-trained AI systems with expert oversight.

Study Details

EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Introduction: Auditory brainstem response (ABR) is a standard objective method for estimating hearing threshold, especially in patients who cannot reliably participate in behavioral audiometry. However, ABR interpretation is usually performed by an expert. This study evaluated whether two general-purpose artificial intelligence (AI) multimodal large language model (LLM) chatbots, ChatGPT and Qwen, can accurately estimate ABR hearing thresholds from ABR waveform images. The accuracy was measured by comparisons with the judgements of 3 expert audiologists. Methods: A total of 500 images each containing several ABR waveforms recorded at different stimulus intensities were analyzed. Three expert audiologists established the reference auditory thresholds based on visual identification of wave V at the lowest stimulus intensity, with the most frequent judgment among the three used as the reference. Each waveform image was independently submitted to ChatGPT (version 5.1) and Qwen (version 3Max) using the same standardized prompt and without additional clinical context. Agreement with the expert thresholds was assessed as mean errors and correlations. Sensitivity and specificity for detecting hearing loss (>20 dB nHL) were also calculated. In cases where the AI and expert thresholds nominally matched, corresponding latency measures were also compared. Results: Auditory thresholds derived from both LLMs correlated strongly with expert opinion, with Pearson r = 0.954 for ChatGPT and r = 0.958 for Qwen. ChatGPT showed a mean error of +5.5 dB and Qwen showed a mean error of -2.7 dB. Exact nominal agreement with expert values was achieved in 34.6% of ChatGPT estimates and 35.6% of Qwen estimates; agreement within +/-10 dB was observed in 75.6% and 80.0% of cases, respectively. For hearing-loss classification, ChatGPT achieved 100% sensitivity but low specificity (20.4%), whereas Qwen showed a more balanced profile with 91.6% sensitivity and 67.5% specificity. Curiously, estimates of wave V latency were markedly poor for both LLMs, with systematic underestimation and weak correlations with the expert judgements. Conclusion: ChatGPT and Qwen demonstrated a moderate ability to estimate ABR thresholds from waveform images, although their performance was not good enough for independent clinical use. Both models captured general patterns of hearing loss severity, but there was systematic bias, limited specificity and sensitivity balance, and poor latency estimation. General-purpose multimodal LLMs may have potential as assistive or preliminary tools, but clinically reliable ABR interpretation will likely require specialized, domain-trained AI systems with expert oversight.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.