Using the Latest Advances in Data Science to Fight Infectious Diseases

Payam EtminaniThe past few years have seen dramatic technology-enabled transformations in many areas of everyday life, including transportation (Uber), accommodation (Airbnb), shopping (Amazon), and communication (Facebook).  Another area where dramatic advances have been taking place is the use of information technology to counter an enemy that has attacked human populations from long before recorded history: infectious diseases.

Every day as we go about our daily lives, thousands of specially trained epidemiologists throughout the country at the federal, state and local levels are keeping you safe as they keep their eye on what is happening with emerging infectious diseases — making sure that any potential outbreaks are spotted as early as possible and acted on. A recent example of this is the effective public health response to the Ebola epidemic in West Africa which saved countless lives and prevented the epidemic from spreading even further.

One of the most dramatic shifts in recent years that is empowering epidemiologists to be more effective at their jobs is occurring because of improvements in data technologies. In the past, the old “relational” data model dictated that data had to be highly structured, and as a result treated in distinct silos. This made it difficult, if not impossible to analyze data from multiple sources to find correlations. Epidemiologists would have to spend many minutes or even hours on each query they ran to get results back, which is unacceptable when you need to test dozens of hypotheses to try to understand and contain a fast-moving outbreak. (Imagine how you would feel if each one of your Google searches took 45 minutes to return!) By contrast, using newer technologies, the same queries on the same hardware can run in seconds.

With the new paradigms, all types of data can be utilized including structured, semi-structured, and unstructured.  Systems can be easily scaled to handle any amount of data without bogging down.  In addition, new data sources can be added in hours or days, instead of months as was previously the case. This enables epidemiologist to expand the scope of their investigations with new information as it becomes available.

Breaking free of silos

Working with epidemiologists, we have observed that important patterns or trends can be missed when data is looked at in silos, but when data from different sources is correlated in a unified view, critical insights occur and answers emerge.

Recently, we worked with epidemiologists at the Veterans Affairs (VA) to improve the process of identifying suspected and confirmed cases of Zika virus infection. These infections are difficult to confirm retrospectively. Diagnosis is largely based on symptoms and the person’s recent history (e.g. fever and rash with a history of travel to an area where Zika virus occurs). Laboratory tests may be difficult to interpret since there may be cross-reactivity with other circulating flaviviruses such as dengue, West Nile, and yellow fever. Also, the timing of testing is important, because the preferred test for Zika (PCR assay) only identifies the virus during the first 5-7 days of illness. Because Zika is so new, there are no specific ICD-10 diagnosis codes for providers to code encounters where Zika virus infection was suspected or confirmed. Furthermore, the virus can cause non-specific symptoms or frequently no symptoms. The VA's approach was to combine data from multiple sources to create a single unified view of the burden of Zika in the VA patient population. By looking at a combination of related ICD-10 diagnosis codes, clinical lab reports, hospitalization records, travel advisories, and social media feeds epidemiologists were able run a number of complex search queries that cross multiple data sets. This enabled them to cast a wider net to rapidly identify people who were at risk of Zika infections and to gain a better sense of the spectrum of the disease. For example, our tool can be used to help identify pregnant Veterans who may require additional testing or follow-up in areas with active Zika virus transmission (such as Puerto Rico, US Virgin Islands and American Samoa).

Another area where we have seen data dramatically assist epidemiologists is in detecting patient infections caused by contaminated medical devices, such as from an improperly cleaned endoscope. This requires a look-back investigation where the epidemiologist looks at events from an earlier time period and analyzes multiple data sets including patient medical records, clinical lab reports, databases of device identification numbers barcoded on each device, surgical, and CPT procedure codes. By combining all these data sources, the investigator can identify devices and determine which patients were exposed and potentially infected and take corrective action.

Possibilities for the near future

As exciting as the progress described above is, we are only in the beginning stages of the public health data revolution. The first phase has been mastering the art of bringing data together and integrating it meaningfully, with the guidance of subject matter experts to make sure the context is correct. The next step is to search for the “known unknowns,” setting the system algorithms to automatically be on the lookout for specific trends or parameters that are indicative of a particular outbreak or other anticipated occurrences. Ultimately, as the machine learning evolves we hope it will become good at prediction. For example, the system might detect a Norovirus outbreak in a specific area. Potentially, it could alert hospitals or healthcare providers in the affected localities and advise them to order norovirus specific testing for any patients they may see with diarrhea or related symptoms. Without this type of alerting/notification, providers may not perform this specific test. With early alerting of outbreaks, patients can potentially be treated earlier and more effectively and the spread of the disease contained so that fewer people are infected.

Our company, Bitscopic, is excited to be part of the 2016 Council of State and Territorial Epidemiologists (CSTE) Annual Conference in Anchorage, Alaska. We will be exploring these issues further in a panel discussion entitled "Mastering the Data Haystack: Unifying Data Across Silos for New Insights" on Tuesday, June 21st at 7:30 AM.  If you will be at the conference, we hope to meet you there. For those who can’t be there but are interested in exploring these issues further, stay tuned as we will have material from the panel posted later.

Attribution: Using the Latest Advances in Data Science to Fight Infectious Diseases was authored by Payam Etminani and published in the Bitscopic Blog. It is republished by Open Health News with permission. The original copy of the article can be found here.