Affordable COVID-19 Diagnoses for Hospitals: How Open Source Software Helps

Andy OramThe most common COVID-19 symptoms—such as coughing, fever, and shortness of breath—are shared with many other diseases. Diagnosing a patient accurately is therefore a challenge. Although a diagnosis of COVID-19 might not affect treatment, it would help a hospital predict a patient's trajectory and anticipate the need for urgent intervention. But current tests, relying on blood or mucus samples, are not particularly accurate.

In this article, we'll see how open source software can help hospitals make better diagnoses. I'll concentrate on one specific role, and on the ways open source facilitates finding a solution and keeping it affordable. Many aspects of the problem feed into the solution discussed here. The article is based on work by researcher Trevor Grant.

Aspects of the problem

CT scans have been shown to work well in conjunction with the familiar test that involves a swab in the patient's nose, the reverse-transcription polymerase chain reaction (RT-PCR) test. Hospitals concerned about a patient at high risk can therefore benefit from administering a CT scan that reveals abnormalities in the patient's lungs.

But already, a problem appears here: a conventional CT scan exposes the patient to high levels of radiation. Grant points out that the group of people most commonly suffering from COVID-19 symptoms—elderly people and those with pre-existing health conditions, are likely to have received many CT scans already, and are placed at risk by each new scan. Therefore, hospitals should resort to low dose chest scans.

Trevor GrantHaving reached this conclusion, we encounter the next problem: low dose chest scans produce lower resolution images. These contain more "noise" (spots in the image that don't actually reflect lung problems) than conventional scans.

And that's where machine learning software comes in. Grant and his team have identified a machine learning model that "denoises" the scan in a cloud computing environment, allowing low-cost processing with free and open source software. Radiologists can determine from the results whether the patient has COVID-19, with accuracy approaching that of normal CT scans. As we look at this software, we'll see additional environmental constraints that Grant's team had to overcome.

Machine learning at the Apache Software Foundation

The particular software functions used for denoising the images come from Apache Mahout, a recent addition to the spectacular collection of machine learning tools under the umbrella of the Apache Software Foundation (ASF). Let me offer a bit of historical background on these achievements.

Machine learning, which overlaps with the related concepts ofdeep learning and neural networks, is the direction that artificial intelligence has taken over the past 15 years. When you hear of spectacular new discoveries in analyzing astronomical, population, or biological data, machine learning probably plays a role. It is also used for the everyday calculations that, like it or not, direct our everyday lives, such as movie recommendations seen by people forced by the pandemic into intimate relationships with their TV screens. The everyday use of machine learning has been facilitated by faster hardware (notably GPUs) and by virtualization and cloud computing, which make it easy to carry out the algorithms by running large numbers of computers for short periods of time.

As a fast-moving research area, machine learning benefits constantly from new discoveries at universities and research facilities. Major computer companies such as Microsoft, IBM,, and Google devote teams of researchers to discovering new machine learning methods, and often publish the methods so that they can be shared by others. Many top-notch machine learning tools are therefore open source.

How did the Apache Project get involved? Some readers may remember Apache as a web server, first released in 1995. The Apache Software Foundation was formed a few years later, offering legal and organizational help to new open source projects that would support the web server. The foundation institutionalized good development practices and came to be seen as a robust roosting place for promising young technologies, particularly through its incubator program.

Like some other foundations in the free software movement, such as the Linux Foundation and the Eclipse Foundation, the scope of the Apache Software Foundation has expanded because it had so much to offer new projects. By now, the foundation has become the preferred haven for open source projects that provide machine learning, as well as related projects in big data and the processing of streaming data.

Machine learning for COVID-19 diagnosis

Apache Mahout, whose leadership team includes Grant, implements a number of powerful and popular machine learning algorithms. Most users run Mahout algorithms on Spark, another popular ASF tool used for processing large amounts of data in distributed fashion.

5.0% denoised. (k=244, oversample=15, power_iters=2)The algorithm used for denoising is abstract, and I won't try to describe it in full because that would require advanced mathematical discussion. Basically, what a hospital needs to accomplish is to eliminate spots on the CT scan image that are "less important" — meaning, in this application, spots that don't reflect the actual image of the lungs. That can be done through a common algorithm called singular value decomposition (SVD).

To apply the SVD, a little preprocessing is necessary. A CT scan image is normally represented in three dimensions. One dimension is a series of "slices," each slice being a two-dimensional image. The SVD algorithm requires a two-dimensional matrix, but it's easy to reduce three dimensions to two: just string out the rows of each two-dimensional matrix, as you might unfold a fold-up walking stick or measuring stick.

The SVD algorithm produces a new set of matrices with powerful properties. Essentially, the algorithm emits a row with all the most important or significant values, then a row with the second-most important values, and so on. The algorithm produces a reduced matrix with abot 300 rows, of which the bottom 5% are discarded as noise.

10.0% denoised. (k=244, oversample=15, power_iters=2)But one more practical consideration stands in the way of the algorithm. A traditional SVD, used on the images of the size produced by a scan, requires 512 GB of memory. But most cloud computing environments, such as Amazon's EC2, offer a maximum of 394 GB of memory. Running a traditional SVD would require a very expensive investment in specialized computing equipment. 

Luckily, Mahout offers a distributed version of SVD for just such situations. By dividing the algorithmic work among many processors and running the algorithm on a computer cluster, hospitals can denoise images cheaply. Grant's team used open-source Kubeflow Pipelines for the solution, highlighted in a book from O'Reilly Media.

The three images illustrate how the denoising works. According to Grant, "in these images we see denoising increasing from 5% to 10% to 30%." He says that "the appropriate amount of denoising is the level that allows the radiologist to most clearly see features of interest (ground glass occlusions in lungs in the case of COVID patients). Among these images, 10% is the best. Further denoising causes image quality to suffer, which is starting to happen at the 30% denoised image."

Project status

Grant has written an academic paper on his team's algorithm and is offering the code to other researchers on GitHub. The team hopes that others can use the algorithm in their own research, such as for preprocessing images they use in their research.

30.0% denoised. (k=244, oversample=15, power_iters=2)So it is fortunate that the Radiological Society of North America (RSNA) announced that an open repository of CT scans of COVID-19 patients will soon be made available to researchers in collaboration with the European Imaging COVID-19 AI initiative, which is supported by the European Society of Medical Imaging Informatics. According to the official announcement, "the organizations expressed the common goal of creating a secure way to share COVID-19 imaging, in order to assess lung involvement more accurately with AI."

In addition, according to this article, both Johns Hopkins Medicine and the Icahn School of Medicine at Mount Sinai have released reports and studies arguing that AI has the potential to enhance the role of chest imaging in the detection of COVID-19.

These efforts could accelerate the development of a strong denoising program.

Although researchers have access to the algorithm right now, Grant says that the Food and Drug Administration (FDA) must approve the algorithm before it is used for diagnosis. A significant amount of work is required to obtain FDA approval, so under normal circumstances it would need to be done by a major company as a commercial product. Given all the money that the government is spending on laboratory tests, perhaps it is time for the FDA, or another agency invest the money to make this tool available to the public.

However, even if the final tool given to hospitals is proprietary, open source software has made it possible to get this far and create a solution that is both high-quality and low-cost.