The Appeal of Graph Databases for Health Care

Andy OramA lot of valuable data can be represented as graphs. Genealogical charts are a familiar example: they represent people as boxes, connected by lines that represent parent/child or marriage relationships. In mathematics and computer science, graphs have become a discipline all their own. Now their value for health care is emerging.

Graph computing made a significant advance this past February in the form of a Graph Data Science (GDS) library for the free and open source Neo4j graph database. Graph databases are proving their value in clinical research and public health; I wonder whether they can also boost analytics for providers. This article explains what's special about graph databases, and some applications in health care highlighted by recent webinars offered by the Neo4j company.

Graph databases and all the rest

Graphs can be incredibly complex, but like many sophisticated structures, they're based on very simple units. Two people-say, England's King Henry VII and Queen Elizabeth I-are connected by a relationship (parent/child). The people can be considered nodes or vertices in the graph, and the relationship is an edge, represented by a line. It is often the relationships or edges that are processed by graph algorithms.

You might also have seen graphs as ontologies such as RDF or the Semantic Web, a pet concept of the Web's founder, Tim Berners-Lee. Although graphs are a radical departure from traditional databases, once you understand what they represent they feel natural.

After defining a graph, the next step is to create algorithms that answer the key questions people have. For example:

  • How many degrees of separation lie between me and Dr. Atul Gawande?
  • What is the shortest path between my computer and the server hosting the web content I'm reading (a common networking problem)?
  • How dense is the graph? In other words, do the people in it all know each other well?
  • Which people resemble each other in their diagnoses and outcomes? This is an example of connectivity metrics, an important type of graph algorithm. It's a graph-oriented version of clustering, a common task for machine learning.

Such questions can be answered by Neo4j, first released in 2007 under the GPLv3 free license. Its Java source code is on Github. The Neo4j pursues the common "open core" business model, offering a proprietary Enterprise service with more help for security and administrative tasks for those willing to pay. I have enjoyed following Neo4j over many years. I spent a 5K walk with several of their staff at one of the Open Source conventions held by O'Reilly Media, and we had a great conversation about free software, diversity, and other topics.

The GDS answers even more complicated questions: For instance, you can find the "degree centrality" of a node, which indicates how closely it's connected to others in the graph. In a graph representing a neighborhood, this number may help determine how much influence a person has over the behavior of others.

Amy HodlerAnother example of a GDS function is Louvain, a popular clustering algorithm. Clustering is familiar to those who do data science: it can show you, for instance, which people are similar along a range of dimensions (age, diagnoses, prognosis, etc.).

In one type of graph, nodes are all the same type of thing. For instance, in a genealogical graph, all the nodes are people. But a graph can also contain nodes of different types. For instance, a graph listing people, diagnoses, and medications can connect these things in complex relationships.

Most data science, using non-graph data stores, treat data as vectors. Thus, people's weights would be listed as a simple array of numbers. Heights would be a different vector or array. From high school math, you may remember that heights and weights can be compared as a matrix, which is a two-dimensional combination of vectors. Beyond the matrix, you can compare more variables using three-dimensional or even thousand-dimensional combinations of vectors, called tensors. This terminology marks popular data science tools such as Google's TensorFlow.

Now contrast the vectors used by conventional data science with the structure of a graph. The conventional ones can show simple relationships, such as who has a diagnosis or what medication they're taking. But let's suppose you want to find a connection between a medication and its side effects. You'd probably search for that medication through the table or vector of patients, pull out all the patients, then search another vector or table for the side effects those patients report. Following the relationships within a graph might be easier.

Graph databases are powerful, but computationally expensive. Their use must be justified by the unique insights they deliver. So next we'll examine some ways they're being used in health care.

Graph databases in clinical research and public health

The Neo4J company is holding a set of online webinars showcasing the use of their database in health care. A good deal of interesting technical material about possible applications for graph databases can be picked up in these sessions. Their webinars can be found in this page. I'll cover two case studies in this section.

Patient clusters

The biopharmaceutical company AstraZeneca used a mix of a graph database and conventional machine learning to find clusters of patients who share illnesses, responses to these illnesses, and course trajectories. The sequence of research was complex, so Amy Hodler, Director of Graph Analytics and AI Programs at Neo4j, explained the sequence to me. The researchers turned up shared characteristics that might help influence treatment: for instance, whether someone who got diagnosed at an earlier stage or saw a specialist more quickly was more likely to improve their condition.

Cynthia FemanoNote: I used the term "cluster" because I introduced that term earlier in the context of machine learning. However, clinical researchers actually call these comparisons patient "communities." Laypeople may find the term "community" confusing, because it does not refer here to people who know each other or live near each other-just people who share the characteristics the researchers are measuring.

AstraZeneca collected a huge amount of longitudinal patient data covering a three-year period: clinical visits, diagnoses, and tests. The date of each event helped them use a graph database to find the trajectory of each patient, and which events might have had a significant impact. The researchers used the timelines created from the graph database to create vectors that could be submitted to classic machine learning, which then detected similarities in the patient journeys. Results of the machine learning were put back into a graph to derive the communities or clusters.

Follow the money: questioning influences

Another research project, led by Cynthia Femano at Neo4j, created a graph database to trace the effects of payments by pharmaceutical companies to doctors prescribing opiates. This has been the subject of highly publicized lawsuits and criminal prosecutions. The researchers combined five different databases to create their graph:

The team's preliminary research makes limited use of the graph. For instance, you can look up a physician and find out what payments they received and how much opiates they prescribed. This graph database should be able to reveal more general trends, such as "A payment of X dollars was correlated with an increase of Y in prescriptions." That will probably be a future project in the program.

A detailed article on the project can be found here, and a video of the webinar can be found below.

Where are graph databases useful?

The sample uses of graphs in this article fall into two categories: clinical research and public health. Neo4j has found this generally true. Graph databases are not typically used for the analytics conducted by hospitals and clinics. I know of two possible reasons for this:

  • Hodler suggested that life scientists and public health experts are already familiar with graph concepts, possibly guiding them to become early adopters.
  • Graph databases are most powerful at finding extended connections that cross two or more relationships. Conventional databases with less overhead can generally be used to find direct relationships, and this may be is fine for typical clinical questions such as, "What patient loads can we expect next month?" or "Which of our doctors have the best outcomes"?

This article therefore ends with a challenge to hospital and clinical IT staff: what important trends could you uncover if you could trace complex relationships spanning multiple nodes? As we come to understand the strengths of graph databases, they may prove useful in clinical settings as well as the applications seen in this article.