Mode
Text Size
Log in / Sign up

Methodological review assesses phenotypes for classifying respiratory viruses in English health dataHow Doctors Track RSV, Flu, and COVID When Tests Aren't Always Done

AI-generated summary of the cited source, checked by automated accuracy review. How we work

Key Takeaway
Consider phenotypes as a cautious tool for virus classification in coded data without testing.

This methodological assessment of phenotypes is a review that examines the use of specific and sensitive phenotypes to classify respiratory viruses—Respiratory Syncytial Virus, influenza, and COVID-19—in English coded health data from NHS England, with follow-up from September 2016 to August 2024. The review compares these phenotypes to publicly available surveillance data, focusing on classification accuracy and seasonal patterns as primary outcomes, along with risks of misclassification for mild and severe cases as secondary outcomes.

Key findings synthesized by the authors indicate that seasonal patterns derived from the phenotypes are similar to surveillance data, though no effect sizes or statistical measures are reported. For misclassification risks, sensitive phenotypes are associated with an increased risk of misclassification for mild cases compared to specific phenotypes, and the risk of misclassification for severe cases is higher in infants than for older adults. The review does not provide numeric data on accuracy rates, sample sizes, or confidence intervals, limiting quantitative interpretation.

The authors note limitations, including that the phenotypes presented offer a solution in the absence of testing information, suggesting reliance on coded data may have inherent gaps. They do not report funding or conflicts, and safety aspects such as adverse events are not addressed. In terms of practice relevance, the review cautiously suggests that these phenotypes could offer a solution for classifying respiratory viruses from coded health records when testing information is unavailable, but clinicians should interpret results with restraint due to the methodological nature and lack of detailed validation metrics.

The Invisible Data Behind Your Doctor Visit

Every time you see a doctor, the visit gets turned into a series of codes — short labels that describe your symptoms, diagnosis, and treatments. These codes are stored in your electronic health record and are used for everything from billing to research.

What most people don't realize is that these coded records are also one of the most powerful tools public health agencies use to track disease outbreaks in real time — without waiting for lab results.

Why Lab Tests Don't Tell the Whole Story

When a respiratory illness spreads through a community, only a fraction of people ever get a lab test. Many recover at home. Others see a doctor, get treated for "respiratory infection," and never have a swab sent off for analysis.

That means official case counts for diseases like RSV (respiratory syncytial virus, a major cause of serious illness in infants and older adults), influenza, and COVID-19 can miss large portions of real-world illness. Public health responses built on incomplete data can arrive too late.

A Smarter Way to Count Cases

Researchers in England used a secure health data platform called OpenSAFELY — which has access to the anonymized health records of tens of millions of NHS patients — to design new ways of identifying these three viruses from coded health data alone.

They tested two types of classification approaches. The first was a "sensitive" phenotype (a loose definition that catches more cases but may include some false positives). The second was a "specific" phenotype (a stricter definition that may miss some real cases but is more precise).

Think of it like a fishing net: a wide net catches more fish, but also more seaweed. A narrow net is cleaner, but you might miss some fish.

Both approaches tracked the same seasonal patterns seen in official surveillance data — peaking in winter for flu and RSV, with COVID-19 surges tied to known variant waves. That agreement gives researchers confidence the methods are working.

What the Study Covered

The analysis covered coded health records in England from September 2016 through August 2024 — nearly eight years of data spanning the pre-COVID era, the pandemic itself, and the post-pandemic period. This allowed researchers to track all three viruses across a wide range of conditions.

What They Found — and Where It Gets Tricky

The good news: both classification approaches matched well-known seasonal trends for all three viruses. The patterns in the data lined up closely with what official government surveillance programs independently reported.

The more cautious finding: mild cases were more likely to be misclassified than severe ones — particularly in infants. A baby who is admitted to hospital with RSV will generate more detailed and more consistent coding than a baby seen briefly in a GP office and sent home. That inconsistency makes it harder for algorithms to be sure what virus was involved.

For older adults with severe illness, both methods performed more reliably. For mild cases across all ages, the looser "sensitive" definition was more prone to picking up the wrong illness.

This research doesn't directly change what happens at your next doctor's appointment. But it matters for how quickly and accurately health authorities can detect a surge — and respond with the right public health measures, vaccine campaigns, or hospital capacity alerts.

Better tools for identifying outbreaks from routine health records means fewer surprises and faster responses when a new respiratory virus season ramps up.

The Limits of This Approach

The study relies entirely on how well doctors code their diagnoses — and coding practices vary between clinics, regions, and even individual clinicians. If a condition is coded vaguely, no algorithm can reliably identify it. The study also focused on England's NHS records, and the methods may need adjustment before being applied in countries with different healthcare systems.

What's Next

The research team has made their phenotype definitions publicly available for other scientists to use and improve. Future work will focus on refining accuracy for harder-to-classify groups — especially infants — and extending these methods to cover additional respiratory pathogens as they emerge.

Ultimately, the goal is to make disease surveillance faster, cheaper, and more accurate — so that the next respiratory virus season doesn't catch anyone off guard.

Study Details

EvidenceLevel 5
PublishedApr 2026
View Original Abstract ↓
Electronic health records (EHRs) are a rich source of data which can be used to analyse health outcomes using computable phenotypes. With the approval of NHS England we used the OpenSAFELY secure analytics platform to design and assess phenotypes to classify three key respiratory viruses - respiratory syncytial virus (RSV), influenza, and COVID-19 - in English coded health data between September 2016 and August 2024. We compared specific and sensitive phenotypes to one another and to publicly available surveillance data. Cases from both phenotypes showed similar seasonal patterns to surveillance data. Sensitive phenotypes led to increased risk of misclassification than specific phenotypes for mild cases. For severe cases the risk of misclassification was higher in infants than for older adults, irrespective of the phenotype used. The phenotypes presented here offer a solution to classifying respiratory viruses from coded health records in the absence of testing information.
Free Newsletter

Clinical research that matters. Delivered to your inbox.

Join thousands of clinicians and researchers. No spam, unsubscribe anytime.