Accelerating Identification and Tracking of Pandemic Disease Outbreaks

Payam EtminaniA national biosurveillance program requires the collaboration of multiple federal, state and local agencies to provide a comprehensive view of a health-related event. Bitscopic's Praedico™ biosurveillance platform breaks down the data barriers among organizations with an extensible architecture that can incorporate any kind of data. The platform also delivers high performance by incorporating the latest technologies such as big data, NoSQL databases, and machine learning.

Overview

Biosurveillance is a matter of national security. A recent story in USA Today described the accidental release of the Burkholderia pseudomallei1, a bacteria that can potentially be used in bioterrorism-related attacks federal and state officials have not been able to identify how the bacteria were released from the lab. One of the worst-case scenarios would be that the bacteria were intentionally released from the lab to contaminate the environment outside of the facility and lead to an outbreak of a deadly disease. The outbreak would eventually be identified, but the speed of data collection, analysis and alerts would be the keys to containing such an outbreak.

Praedico™ biosurveillance delivers breakthrough capabilities

The current national biosurveillance system involves the collaboration of approximately 30 federal agencies. The process involves the gathering and sharing of data by data analysts. The work depends heavily on the experience of the analysts and subject matter experts evaluating quality data, in real-time that is actionable. Though effective, there are many manual steps in the process, and there are still limitations in combining data from different sources to provide a more holistic picture of a scenario. Furthermore, due to the 24-hour delay in gathering and sharing the data, the speed at which the disease spreads may result in significant casualties.

Joel MewtonBitscopic has developed Praedico™ Biosurveillance to leverage the leading technology advancements in software architecture, Big Data, and machine learning (ML) to create a system designed to gather and process huge amounts of data. The same technology advancements are widely used in the private sector by companies like Google and Facebook whose whole business is based on gathering and analyzing hundreds of terabytes of data every day. With sophisticated machine learning algorithms, they are able to better predict the needs of a user and present relevant ads. Praedico™ Biosurveillance includes the best parts of these new technologies and applies them to the biosurveillance use case to create an application that is easy to use, performs most queries in seconds and enables analysis across single or multiple sets of data from any source.

Praedico™ Biosurveillance was designed to accelerate the speed at which data can be analyzed and used for decision-making. Its ability to perform analysis on various sources of data creates a holistic view of a health-related event or problem. With sophisticated mapping and graphing tools, the analysis can be changed on the fly or include other ad hoc data for further clarification or insights. With Praedico™ Biosurveillance, the monitoring of health-related events will be more real-time, more accurate and easier to share with key decision-makers.

Biosurveillance Background and Core Functions

With the rapid and constant movement of people, animals, plants, and food in today's world, the timely detection of biological events is an essential tool to confront catastrophic events. Biosurveillance is a concept that emerged to protect the nation from biological threats from infectious disease outbreaks and bioterrorism. The responsibility for the program is spread across an array of federal agencies, which are required to collaborate with state and local agencies to fulfill the objective.

In 2007, Congress established the National Biosurveillance Integration Center (NBIC) within the Department of Homeland Security (DHS). A few years later, in 2012, the White House issued a "National Strategy for Biosurveillance" that emphasizes coordination and collaboration across agencies. The strategy provided a set of goals for a well-integrated national biosurveillance program with the goal of providing essential information to support decision making at all levels.

NBIC is a step towards providing a single point of collaboration versus having each agency integrate the numerous databases among numerous agencies. In 2003, the Center for Disease Control (CDC) identified 120 surveillance systems that lacked integration capabilities and began to create processes to gather and to analyze the data to provide meaningful insights.

Improvements to inter-agency collaboration have been made with new systems such as the National Electronic Telecommunications System for Surveillance (NETTS). However, there are still significant challenges to integrating and leveraging data from other information systems that could drastically improve the CDC's data-gathering efforts. Some examples of information systems that are not incorporated into NETTS are listed in Table 1. This is far from an exhaustive list, and there are many additional systems used by various health departments, agencies, and health networks.

The sheer number of systems is a direct result of the decentralized nature of human health-related data. Each system is currently a silo, and data is not easily shared or analyzed together. Adding data related to agriculture, water quality, airborne pathogens or other relevant datasets becomes extremely cumbersome if not impossible to do in a timely manner. The main reason is that the systems involved were each created for a specific program. They were not designed to be a part of a holistic system that gathers together and leverages all the different data. The situation prohibits the implementation of a comprehensive biosurveillance strategy.

Coordination Among Presents Data Collection and Analysis Challenges

Many federal agencies have monitoring programs for their agency-specific missions. The programs were not designed to participate in a national biosurveillance program or a national goal for security encompassing pathogens causing sickness, agricultural events causing mass starvation or any other event that affects the spread of disease. Figure 1 visually shows the agencies and their scope of responsibilities. To get a true picture of the bio-health of the nation, data from each of the agencies would be required. Unfortunately, the ability to share data and information among the agencies is limited and cumbersome.

To overcome the obstacle of lack of collaboration among agency, DHS published a "National Biosurveillance Integration Center Strategic Plan." The plan laid out a set of goals that could apply to a successful biosurveillance integration center. Table 2 lists the goals to detect acute biological events that can occur across several domains: human, animal, plant, food and environmental health.

To transform and have biosurveillance as a national capability will require the fulfillment of the goals outlined in Table 2. Early warning and shared situational awareness are critical to making well-informed risk management decisions. Today's systems are not compatible across agencies. A new approach is needed that provides the capabilities to integrate data across systems and enable true collaboration across agencies. The result will accelerate the identification, characterization, localization, and tracking of acute biological events.

The diverse nature of technologies used across the various agencies and organizations can often prohibit collaboration. Though standards may exist for electronic health records, many other data sources do not. An effective system will provide the ability to integrate any data.

Effective decision making requires solid data, and today's system has a significant delay from data collection to data analysis. The cause is primarily due to the conversion and normalization of the data so that they can be integrated. An effective system will provide real-time analysis with the ability to integrate ad hoc data sources.

The Veterans Health Administration Leverages Praedico™ for Biosurveillance

The Office of Public Health Surveillance Research (OPHSR) within the U.S. Department of Veterans Affairs (VA) started the Healthcare Associated Infection and Influenza Surveillance System (HAIISS) project in 2006. The project brings the Veterans Health Administration (VHA) in compliance with the White House directive issued in Nov 2005, "National Strategy for Pandemic Influenza." To expand its biosurveillance capabilities the VHA selected Praedico™ Biosurveillance. HAIISS creates an infection identification, reporting, and alert system. It is essentially a targeted biosurveillance system for a specific set of diseases and outbreaks. Using Praedico™ Biosurveillance, the VHA will be at the forefront of biosurveillance capabilities and provide improved healthcare outcomes for veterans.

The VHA has already deployed the Praedico Data Platform (PDaP). PDaP extracts and imports data from VHA's advanced electronic health records (EHR) system and other sources and processes the data. Data processing includes cleansing, normalization, correlation, conflation and standardization. At this point the data can be published to applications and services within and outside the VHA. This established a comprehensive electronic surveillance data repository to monitor HAIISS events such as infectious disease outbreaks and drug resistance, healthcare associated infections, influenza, and other emerging infectious diseases or syndromes associated with natural and/or bioterrorist activity. Data is gathered and integrated multiple times throughout the day, resulting in near real-time analysis and reporting.

The VHA's evaluation of Praedico™Biosurveillance compared its performance and capabilities to ESSENCE (a biosurveillance system developed in the early 2000's at John Hopkins University). The conclusion of the VHA evaluation was that Praedico™Biosurveillance provides a comprehensive biosurveillance system with the same or better performance and capabilities than ESSENCE.

The comparison method first analyzed one million VHA patient records as a training set for validation of the Extract-Transform-Load (ETL) layer into a designated biomart within the VA Healthcare-Associated Infection and Influenza Surveillance System (HAIISS). It was then analyzed for utilization, processivity, searchability, display functions, timing, and accuracy.

In addition, a validation set of combined VA and the Department of Defense (DoD) biosurveillance data comprising 17 million DoD and 25 million VHA records respectively, was used to assess the performance of Praedico™ using ICD-9 encounter codes from known influenza-like illness (ILI) outpatient visits previously analyzed using ESSENCE alone.

Though not specifically evaluated, future uses of Praedico™Biosurveillance could include the prediction, detection and monitoring of other bio-related health events such as:

  • Antibiotic resistance trends
  • Bioterrorist events
  • Intensive care unit devices
  • Influenza outbreaks
  • Surgical site infections
  • Infectious diseases of public health significance

Key Technology Advancements Benefiting Biosurveillance

In the past few years, significant advancements in computing, data analysis and machine learning provide benefits for biosurveillance. These technology advancements are currently employed by leading companies such as Google, Facebook, Microsoft and Amazon.

Scale Out Computing

Leading web-based companies serve billions of requests every day. Performance demands for these companies have resulted in the scale out architecture, where application performance can be increased with the addition of computing power in additional servers. This is a departure from the traditional enterprise architecture that is designed using the scale up architecture (where more power is added to the existing servers - a method that quickly hits limits of efficiency).

With the proliferation of web applications, mobility and Internet connected devices, the velocity, variety and volume of data available for analysis have exploded. Biosurveillance can take advantage of this by incorporating unstructured data sources such as Internet web searches, retail purchase data or weather patterns.

Scale out computing is ideal for applications that scale by distributing its compute and storage needs across computers versus requiring a single, larger capacity machine. The architecture results in lower capital and operational expenditures since low cost off the shelf (OTS) servers are used versus high end customized configurations. Figure 2 displays the difference between scale out compared to scale up computing.

Big Data

As previously discussed, an effective biosurveillance program must integrate and analyze a combination of data from a variety of sources. Though traditional analysis relies on structured data, integrating unstructured data can improve the accuracy and reduce the time to respond to potential issues. Big data has emerged as the technology used by leading companies to analyze very large amounts of data, from multiple data sources with complex relationships.

As data sets and data sources increases, the amount of data grows exponentially. With traditional data analysis, the solution to having too much data is to take a statistically random sample and create calculations with confidence interval bands. Various statistical distributions must be compared by trial and error. Using Big Data techniques, however, allows for the analysis of every single piece of data. The final result is a complete understanding of the entire population in the data set.

The benefits can be demonstrated with a simple example. The following example shows a fictitious scenario for demonstrative purposes only. The primary point is that using more data provides a clearer picture of the population being analyzed, resulting in a faster decision process with less noise.

Example

Base Scenario of Identifying Influenza Infections:

  • 1,000,000 people database who bought cold medicine
  • Historically three percent of the people who buy cold medicine have Influenza
  • Contacting a person takes one minute per person

Traditional Data Analysis

  • 1,000,000 phone calls
  • 2,500 weeks of work or 100 people working for 250 weeks to reach all 1,000,000 people
     

Big Data Analysis

  • Build a scoring profile using the complete set of 1,000,000 database records of cold medicine data
    • Include other data sets that can help identify Influenza cases
    • Other data could include doctor visits, age, geography, weather
  • Based on the scoring profile, 250,000 identified as high probability
  • Reach out to the 250,000 people and update model as new data becomes available

As shown in Figure 3 below, big data has the capability to provide clarity when analyzing a finite population. As the data set is updated, expanded or additional fields are added, big data can continue to make use of the data to improve accuracy.

Machine learning

Machine learning is a scientific discipline that explores the study of algorithms that improve over time as the data being analyzed improves. The algorithms are mathematical optimizations rather than explicit rules-based calculations. Part of the field covers the area of pattern recognition, which is a critical biosurveillance capability.

The importance of machine learning has gained prominence with the advancements in big data. There are three phases in data analysis: collect, analyze and predict. Big Data has traditionally focused on collecting and storing data. For true insights into predictive modeling requires a system that continually iterates through the data, builds a model, tests it and includes other data if needed. This type of analysis has usually been the domain of data scientists. However, as the big data ecosystem continues to evolve, this powerful capability is being made available to every day users.

Praedico™ Biosurveillance Provides Breakthrough Capabilities To Execute the National Strategy

Praedico™ Biosurveillance has been purpose built to integrate biosurveillance information from various sources and enable early warning to decision makers. The product is based on the latest commercial technologies available and provides an end-to-end system to achieve the DHS National Biosurveillance Integration Center Strategic Plan.

The tests conducted by the VA's Office of Public Health, demonstrated that Praedico™ Biosurveillance requires significantly less data storage and provides the same level of analytical capabilities as the ESSENCE platform. A key capability provided by Praedico™ Biosurveillance was the ability to perform analysis on integrated data from both the VA and the DoD. Queries have reduced response times, often producing results in seconds for analyzing databases with millions of records.

Praedico™ Biosurveillance

Praedico™ Biosurveillance is a state of the art biosurveillance toolset designed for early detection, monitoring, and forecasting of infectious disease outbreaks. The product also has the ability to analyze and detect relevant abnormalities, effectively uncovering the unknown unknowns.

Praedico™ Biosurveillance provides the following features to accelerate the analysis of data and identify relevant events.

  • Intuitive easy to use interface for analysis and visualization.
  • Rapid and efficient integration of new data from disparate sources.
  • Data migration from existing systems to an integrated data mart.
  • Multi-faceted algorithmic capabilities that are fully customizable.
  • Alerts and alarms filtered by urgency.
  • Ad hoc data gathering, analysis on demand.
  • FISMA compliance (includes role based security and auditing).
  • FedRAMP compliance.
  • Integrated electronic disease reporting and surveillance system with electronic laboratory reporting (ELR).

Figure 4.1 shows an example of the advanced analytic capability of Praedico™ Biosurveillance. The figure shows how easily the same data can be analyzed in multiple formats. The data being displayed is the anonymized H1N1 outbreak numbers from the 2009 flu season. The map shows the geospatial relationship between VHA and DoD data, whereas the time series graph shows the total number of influenza-like-illness (ILI) cases as well as the proportion of ILI cases over the number of all cases.

The significance to an epidemiologist of being able to geographically visualize several very large data sets cannot be overemphasized. Trends and anomalies of concern in specific geographic areas can be spotted much more rapidly and intuitively. In addition, being able to visualize multiple datasets side by side (in this case VA and DoD) can be very helpful.  In this example, the DoD dataset will tend to be younger in age on average than the VA dataset, and being able to see trends that affect both, or just one, of these datasets can be very insightful for the epidemiologist.

Categorization graphs of user selected parameters and the raw line listing of the data are also displayed for viewing on a single page. All of the data is consistently marked in blue for the VHA and green for the DoD. Generating this type of analysis using multiple data sources is built into Praedico™ Biosurveillance and can be done so on demand. Variations of this data can also be displayed using multiple variables such as time period, ICD-9 codes or customized groupings. These operations can be completed in near real-time for analysis and pattern identification based on the underlying data.

Figure 4.2 shows the data depicted in a time-series format which is critical for an epidemiologist in spotting trends and any unusual spikes. As in the case of Figure 4.1, being able to compare these trends across two or more different datasets deepens the insights gained by the user.  In addition, this graph shows the numbers as a percentage of total population, which is helpful in comparing two datasets of different sizes where a pure raw numbers comparison would not be as useful.

 

Figure 4.3 shows a breakdown of VA and DoD data by facility and zip code. This is helpful for the analyst in determining whether there might be an unnatural bias to the data that they might need to take into consideration. For example, if there was an unusually high number of records in one particular location, the analyst would want to know why that was the case.

 

Figure 4.4 shows the ability of the system to drill down to the individual patient record level. This can be helpful in looking more closely at a specific set of patients that might, for example, exhibit a specific set of unusual symptoms or have been known to have been admitted to a certain facility during a specified time period.

 

Praedico™ Biosurveillance is a next generation biosurveillance application that incorporates cloud computing technology, Big Data, and machine learning. Praedico™ Biosurveillance has the ability to assess and query data across organizational boundaries. The interface is intuitive and allows dynamic definition and analysis of geospatial data, multiple EHR domains and customizable social media streaming from public health-related sources.

Praedico™ Biosurveillance also includes toolsets that enable an in-depth multidisciplinary analysis to provide timely and relevant information to support decisions. User-defined reporting allows for custom or ad-hoc analysis tailored to a specific project. Data migration tools are also available to leverage data from existing biosurveillance applications, such as ESSENCE, for analysis using Praedico™ Biosurveillance.

Paedico™ Biosurveillance also has the ability to perform on-demand analysis across several user-defined data categories. Figure 5 shows an example of a query across three different data sets. Essentially the query is a join operation whose output is the intersection of the data points. The output can be displayed on a time series graph that can be dynamically changed to show the output aggregated over different time periods.

The speed at which data can be analyzed directly impacts the time it takes to identify novel health events and communicate the information for multi jurisdictional interest. The ability for rapid and efficient integration of disparate data sources also allows algorithms to be tested to improve the predictive capabilities of a biosurveillance program.

Praedico™ Biosurveillance has unprecedented algorithmic flexibility and provides one of the industry's most robust feature sets in this area. Rather than relying on a single statistical model, Praedico™ Biosurveillance combines multiple statistical methods, supervised classifiers and expert designed rule based models to detect health related outbreaks. Additional information such as confidence intervals, geographic spread and the spread rate can also be calculated.

Both statistical and machine learning techniques can be applied to discover data abnormalities. The advanced statistical capability is highly effective in detecting abnormalities in univariate time series data. It leverages machine learning techniques to improve the understanding of the relationship between dependent and independent events and the dynamics of phenomena.

The built-in powerful algorithmic capabilities can be enhanced or modified based on the scenario and analysis context. For example, Praedico™ Biosurveillance allows the specificity of a symptom for an outbreak alert system. An expert user's interaction with the system in treating a result as a false positive event automatically improves the overall accuracy of the alerts by automatically tuning the algorithms.

Figure 6 provides the high level architecture for Praedico™ Biosurveillance. The design of Praedico™ Biosurveillance leverages the technology advancements mentioned previously, resulting in unprecedented flexibility and performance for a biosurveillance application. Highlights of the product's architecture are below.

  • Big Data Technology: Integration of open-source big data technologies for scalability, flexibility and cost benefit.
  • Computing Performance: Designed to handle very large data sets without compromising data access performance. Support for real-time ad-hoc data analysis on large-scale data.
  • Scale Out Computing: Utilizing commodity hardware for clusters of servers for fast distributed processing, load balancing, high availability and failover support.
  • Machine Learning: Machine learning algorithms to support event discoveries, data mining, intelligent decision making and alerts.
  • Flexibility: Flexible plug-in architecture to extend and incorporate new technology stack to cater to different use cases.
  • Support for different data types: Support for structured and unstructured data types, collection of data from various data platforms.
  • Security: Enterprise security integration with role based data authorization, auditing and data protection.
     

One of the keys to collaboration across agencies is the ability to share data. As previously discussed, each agency has their own system, which was not necessarily designed for easy data sharing. Interagency collaboration is achieved by enabling data integration regardless of the system source or format. The Praedico Data Platform (PDaP) already has made available several VA EHR datasets (or domains) including:

  • Outpatient visits
  • Inpatient hospital stay
  • Inpatient bed section
  • Inpatient surgery
  • Inpatient procedure
  • Telecare calls

The integration allows VA and DoD staff to monitor bio-related data across the VA and DoD population. Furthermore, the shared data can also be used to add value to other analyses. Sharing and leveraging data across diverse information sources increases the likelihood of identifying trends signaling a threat and to track and provide ongoing situational awareness.

Conclusion

Biosurveillance is a broad objective that touches federal, state and local agencies as well as private labs and hospitals. The DHS has a strategic vision to advance the safety, security, and resilience of the Nation by leading an integrated biosurveillance effort that facilitates early warning and situational awareness of biological events. The key to an effective biosurveillance system is the ability to pull data from disparate systems and sources and incorporating the data to provide a holistic and integrated view of all relevant events. However, data silos often make it difficult to achieve this.

Current ESSENCE implementations have created islands of data that cannot be easily leveraged across agencies and organizations. With the diverse nature of data required for an effective biosurveillance strategy, a solution that can unify the disparate data sources and simplify the analysis of the combined set of data is necessary. Praedico™ Biosurveillance provides these capabilities and creates a system that enables early warning and shared situational awareness of acute biological events to support better decisions through rapid identification, characterization, localization, and tracking.

Praedico™ Biosurveillance also incorporates the latest advancements in the fields of Big Data, scale out computing and machine learning to accelerate data analysis, which leads to faster detection of biological events. It has an extensible architecture that can evolve over time to ensure that it can incorporate data from any source. With its intuitive interface and high performance queries, users can spend more time analyzing data versus performing data administration tasks.

A biological event requires rapid detection. Once detected, the decision makers must have accurate information to take action to contain the event. Praedico™ Biosurveillance is the industry-leading platform that provides an end-to-end solution from prediction to detection to monitoring. It can help the nation create a state-of-the-art biosurveillance program and enable effective collaboration among federal, state/local agencies and private healthcare organizations to create a state-of-the-art biosurveillance system.

1 Presidential Directive: National Strategy for Biosurveillance, July 2012

References

Biosurveillance: Efforts to Develop a National Biosurveillance Capability Need a National Strategy and a Designated Leader. U.S. Government Accountability Office, June 2010.

Holodniy, et al. Evaluation of Praedico™, a Next Generation Big Data Biosurveillance Application. Office of Public Health, Department of Veterans Affairs, Bitscopic, Inc., Armed Forces Health Surveillance Center.

National Biosurveillance Integration Center Strategic Plan. U.S. Department of Homeland Security, November 2012.

Presidential Directive: National Strategy for Biosurveillance, July 2012

Allison Young, Fifth Monkey Has Signs of Deadly Bacteria in Lab Mishap. USA Today, Marc 4, 2015.

The contents of this article are based on a white paper published by Bitscopic titled Praedico™ Biosurveillance and published in March, 2015. The white paper can be downloaded here.