Why Open Source Will Rule Scientific Computing

Open Source on the Main Stage

Three years ago I had the privilege of attending the Open Source CFD International Conference in Barcelona, Spain to deliver a software keynote presentation on the future of scientific computing (access to proceedings requires registration). I enjoy such venues because they give me a chance to meet intriguing people with different viewpoints on open source; and to break away from my usual frenetic routine to think about some of these issues more deeply.

This conference did not disappoint; one of my highlights was the industrial applications keynote given by Dr. Moni Islam, Head of Aerodynamics & Aeroacoustics at Audi. What particularly struck me about Dr. Islam’s presentation was how closely it mirrored what I presented in my keynote, except that his perspective was far more valuable in that it came from a large manufacturing concern using open-source software in real world applications.

This, of course, was very gratifying in that it crystallized trends that we have seen emerging at Kitware, and it spoke more generally towards the future of open-source software in scientific computing. Dr. Islam’s presentation was not an isolated incident, at Kitware we have worked with many customers representing large, traditionally conservative companies who are convinced that open-source software is the way to go.

I think there are inexorable forces that will elevate open-source software to widespread acceptance, and eventually dominance, in the scientific computing market. For some time now I have harbored a secret belief that this is true, but when customers and scientific computing professionals start saying the same thing, it’s time to pay careful attention. Key forces will be:

  • Open Science
  • The search for authenticity
  • Quality-inducing, agile, collaborative software process
  • Scalability
  • Business model

My belief is that the open technology and market forces will make the use and adoption of open-source software essential for effective scientific computing. Why? The market will demand it; the pace of scientific research will require it; and the size of the technology problems we’re tackling will necessitate that we practice the Way of the Source.

The way I see it, the frontiers of technology are expanding at an ever-increasing pace. To keep up and participate with this unprecedented expansion of knowledge, and to succeed in business and as technology contributors, requires us to discard inefficiencies in our processes and seek better ways of creating and engineering knowledge. While this may be as straightforward as improving a software tool like a bug tracker, it also challenges us to reject old ways of thinking such as treating knowledge as a finite resource that we divide and squabble over, or treating individuals outside our immediate organization as threats and competitors. I am a firm believer in growing the pie rather than fighting over the current pie; I think opportunities are exploding all around us and I’d rather ride on the leading-edge of the wave than get caught up in the pettiness of turf battles.

I would like to add here that when I claim that open source will rule scientific computing, I am not implying that there will not be any proprietary code and solutions; there will always be good reasons for such systems due to security, privacy, or business concerns, and we will continue to work with customers to provide such solutions, including incorporating open-source technology into proprietary systems.

While my preference is to use open-source whenever possible, I believe that drawing a hard line in the sand is counterproductive to the overall FOSS movement. I think it does more harm than good to go militant on this front; many of our most important customers from medical, geophysical, and pharma greatly appreciate the OS value proposition and want to do the right thing.

This includes funding significant technical efforts that are contributed back to the community, often with no recognition to the contributing organization, usually to pare down the maintenance burden and ride the FOSS innovation wave. And this can be done all the while retaining key proprietary technologies as a business advantage. Spurning such customers is akin to the proverbial “cutting off the nose to spite the face.”

I have no problem with such relationships since it benefits the community and the public good, and to co-opt a famous phrase (from a very proprietary software company) we aim to “embrace and extend” our customers to introduce open-source into their corporate world, extract useful technology that we can all use, and build exciting alliances that will let us all win together.

The remainder of this article summarizes why I think open-source is the future of scientific computing software.

Open Science

It breaks my heart to say this, but too often scientists and engineers (of all people) have forgotten the importance of openness, transparency, and reproducibility in their practice of science. For example, in the fields I am most familiar--visualization, graphics, and biomedical research--recent years have seen the publication of many papers with impressive claims and high-level descriptions of algorithms, which are extremely difficult if not impossible to replicate because the source code and data are not available. 

Further, some researchers treat their data as proprietary and seem to make a career out of analyzing it, with no chance for others to replicate results. Other researchers publish complex algorithms and frequently leave out important implementation details or parameter settings. As a result, some algorithms may take years to reproduce even by expert researchers, and the actual results can vary widely due to missing implementation details and parameter settings.

Many of us in the scientific computing community have come to the conclusion that the only solution to this sad state-of-affairs is to practice Open Science, of which open-source software, open data, and open access play key roles.

  • Open-source software enables researchers to rapidly reproduce the results of computational experiments and explore the behavior of algorithms.
  • Open data enables researchers to apply their software to pertinent test cases, and compare competing algorithms.
  • Open access enables others to read and access publications in order to understand the science.

Further, from the evidence I've seen, the practice of Open Science provides many other benefits including fostering rapid innovation, fair comparison of technology, and providing an ideal resource for educating technologists of the future.

Many others have independently come to the same conclusion. What is particularly heartening is the emergence of open access journals, such as the now well-known PLoS series of journals. 

Open data initiatives are becoming widespread, ranging from Data.gov (increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government) to sites that aggregate data such as InfoChimps, Freebase, and Information Aesthetics. Even the US Library of Congress has starting exploring ways of releasing open-source software. 

Just recently, Jill P. Mesirov, a member of the Broad Institute of Massachusetts Institute of Technology and Harvard, proposed a Reproducible Research System (RRS) based on the concept of reproducible research proposed by Jon Claerbout in 1990. In Dr. Mesirov's vision, the RRS system

"consists of two components. The first element is a Reproducible Research Environment (RRE) for doing the computational work. An RRE provides computational tools together with the ability to automatically track the provenance of data, analyses, and results and to package them (or pointers to persistent versions of them) for redistribution. The second element is a Reproducible Research Publisher (RRP), which is a document-preparation system, such as standard word-processing software, that provides an easy link to the RRE. The RRS thus makes it easy to perform analyses and then to embed them directly into a paper. A reader can readily reproduce the analysis and, in fact, can extend it within the document itself by changing parameters, data, filters, and so on." (By the way, I'd provide a link to the full article above, but it is only available through subscription, membership, or fee. Such is the state of scientific discourse in 2010: that an article espousing open research environments requires payment to access).

Obviously Kitware has taken the lessons of Open Science to heart by developing open-source software and providing open data; however, we feel like we've added a unique twist to the mix as demonstrated by the Insight Journal, the VTK Journal, and the many offshoots of the Midas Journal. 

What makes these journals special is that they are not simply collections of PDF documents and data; they also support source code submissions, which are compiled, executed, and tested as part of the submission process. Thus, automated evaluation of source code and data is combined with human review of the published technology to generate a final assessment of the (software) technology.

Another area where we've had some impact is working with a mainstream publisher the Optical Society (OSA). With Kitware's help, they have produced the Interactive Science Publishing system (ISP) that enables the creation of active documents. The documents have special, embedded links that launch a viewer and download (open) data, enabling readers to interactively explore the data.

At the heart of what we do as technologists is practice Science. At the heart of science is the ability to reproduce the results of others. Thus Open Science, and the practice of open source, is critical to the future of scientific computing.

Search for Authenticity

Authenticity is one of the characteristics of open source software that matters the most to me and is, quite frankly, somewhat personal (the reader will have to excuse my indulgence on this one, I have a thing about realizing authentic experiences).

I don’t know about you, but when it comes to the usual sales process I am tired of marketing spin, half-truths, and flat-out lies; gimmicks designed to attract attention but not deliver the goods; product claims with asterisks; and improbable fantasies involving crime fighting in stiletto heels and tailored suits (James Bond anyone?).

While all of this can be fun at times, especially the crime fighting bits, these distractions get in the way of solving real problems, and often leave many of us feeling exhausted in our quest to get the bottom of potentially deceptive claims.

Open-source software offers a better alternative: with some work, customers and users can evaluate the embodied technology; try the software on their own data, and determine what needs to be done in order to fit it into their workflow. Companies can also assess such software with their own in-house technologists, develop their own expertise, and avoid lock-in to a proprietary solution.

To me, these are the hallmarks of an authentic experience. As users, we can get down into the guts of the reality. If we don't like the reality we see, we can learn from it, modify it, or even go on to create our own preferred version. I find this wonderfully refreshing, not to mention empowering, and as a result I mostly feel invigorated when I engage with open-source products.

I’d like to take a moment to distinguish transparency from authenticity in this context. While transparency (obviously related to Open Science) is typically required to determine whether something is genuine (i.e., authentic), what I am talking about goes beyond just seeing through something to address the integrity of a technology. Looking at on-line dictionaries, the words I see to describe integrity are incorruptibility, soundness, completeness, and honesty. These are all good words that describe processes necessary to the practice of good science as embodied by open-source software.

I am confident that in the future, the ability to present technology and products in an authentic manner will go a long way in the success of commercial, academic, and research ventures. Organizations and communities that can speak clearly, honestly, and openly about their products will be sought after as discovering an oasis of authenticity in a desert of hype will be increasingly valued as the attention seekers of the world continue to do their vacuous thing.

Quality-Inducing, Agile, Collaborative Software Process

To be honest, there are many people at Kitware who can speak much more authoritatively than I can about software process. If you really want the details, I suggest that you visit Bill Hoffman's blog post or go directly to his Google Tech Talk video. Ultimately what I want to do here is examine the dynamics of the open-source software process and make the case as to why it's better for scientific computing.

As many of you know, the Kitware software process is built around the core tools: CMake (cross-platform build), CPack (cross-platform packaging and deployment), CTest (testing client) and CDash (testing server). These tools are the culmination of more than a decade of organically creating a low-overhead, effective, and integrated software process.

Initially we created these tools because, like all good programmers, we have a lazy streak and like to meet excessive work load with automation. For example, in the early years of VTK, we got very tired of tracking down bugs every few months as we released software; as most of us have experienced, it can be very hard to uncover a bug introduced months before it is discovered.

Very early on with partners in the open-source community (notably GE Global Research as a result of a Six Sigma quality project), we began the creation of what is today’s software process. We initially focused on automated testing, followed by reporting to a centralized webpage or dashboard. Over the years this has expanded to include all facets of software development including communication, version control, documentation, bug tracking, deployment, and more sophisticated testing.

I hope you noticed that I snuck the word "organically" into the previous paragraph, since it is key to the point I am trying to make. Many of those contributing to the creation of this software process have roots in the computational sciences, or have computer science training with a strong desire to deploy useful tools in the scientific computing community. This is inherently a world in which collaboration is as natural as breathing: computer scientists working with domain experts to exchange ideas and implement powerful computational tools.

This world is also characterized by rapid technological change; hence the roots of this process were openness, to foster collaboration, agility, and responsiveness to advances in technology. Thus, the software process that grew organically from the creation of CMake, VTK, ITK, and many other open-source tools is a microcosm of the larger scientific computing world. I believe that we are not alone: the animating spirit of collaboration and agility is also a hallmark of most open-source projects; hence why open-source processes are superior for scientific computing.

You may be wondering how "quality" fits in to the mantra of "agility, collaboration, and quality". This is partially because as developers we want to create code that the community can depend on and we can be proud of. Part of it too, especially in the early years of the open source movement, was to counter the belief that open-source software was somehow inferior or amateurish.

Now we know that as long as we as a community abide by our very rigorous software process, we will create outstanding software systems. However, I believe the major reason that quality is so important in the open-source world is that collaboration and innovation/agility require a firm foundation on which to grow. Only with a disciplined process can robust growth be assured.

Now I believe Kitware’s secret weapon is its powerful, low-overhead, and quality-inducing software process. While there are a lot of individuals and organizations producing outstanding algorithms and systems, we too often find that external code is not cross-platform, breaks easily, and is inflexible and unstable in response to new data, parameter settings, and computing platforms.

At Kitware, we do not necessarily claim that we are better programmers and therefore avoid these problems (though we are certainly among the best :-)), rather we have a better software process that helps us identify and fix these problems faster. As a result, our toolkits and applications are known for their stability, robustness, and flexibility, which is why thousands of users and customers build their own applications based on toolkits such as VTK and ITK.

While I enjoy extolling the virtues of software process and could easily go on for several more pages on this topic, it’s important to get back to the point, namely that an agile, collaborative, quality-software process is critical to scientific computing.

Technology is moving so rapidly that users and developers need to be able to respond as a community to new developments, refactor code, and fix software issues to keep up with relentless technological change; and while doing it, they have to have confidence that the technology they are developing is of high-quality. Waiting for a proprietary code base to respond to change and issues is not tenable for most organizations.

Scalability

I think we can all agree that the technological world is growing: data is getting bigger; computing systems are becoming more complex; research teams are growing in size and scope; and technical solutions require integrating multiple technologies. As a result, keeping up and making use of current technology is becoming more difficult, and even the biggest organizations are challenged by the need to maintain staff and resources.

The failure to address this scalability challenge shows up in many surprising and insidious ways. For example, as data becomes bigger, it is natural to address computational problems by using parallel computing; i.e., use multiple computing processes to operate on data in common memory. However, users typically discover that the initial benefits of shared memory, parallel computing rapidly disappear due to scalability issues. In this case, the very simple operation of writing a number to a specific location in computer memory becomes problematic as the number of computing processes grows large and traffic jams result from thousands of simultaneous write requests.

Scalability issues also show up in many other less obvious, but equally challenging and interesting ways. For example, as system complexity grows, how do you develop software? Test and debug it? Create and implement intricate algorithmic solutions? Manage a large software community? If you are licensing software, how do you deal with possibly tens-of-thousands of computers scattered across an organization, many with multiple processors and frequent upgrades? While we are far from answering these and dozens more scalability questions, it does appear that open-source approaches do offer some advantages.

For example, as many have argued before me, open-source software processes scale better than proprietary models when it comes to developing and testing software. For example, Eric Raymond famously stated in his book The Cathedral and the Bazaar that “open-source peer review is the only scalable method for achieving high reliability and quality”. The book Wikinomics argues pretty persuasively that open-source approaches successfully pull together teams from disparate organizations and with widely ranging talents to solve difficult problems.

I particularly love the recent example of collaborative mathematics, as described in the October 2009 issue of Nature. In this project, Timothy Gowers of Cambridge ran an experiment in collaboration by describing on his blog the Polymath Project. While the goal was to solve a problem in mathematics, Dr. Gowers wanted to attack the problem using collaborative techniques inspired by open-source software development (Linux) and Wikipedia. Surprisingly, within six weeks, the problem was solved with the contribution of over 800 comments and 170,000 words on the project wiki page, involving participants as diverse as high school teachers to mathematics luminaries.

To quote the article referenced above, the project was successful on many fronts:
“For the first time one can see on full display a complete account of how a serious mathematical result was discovered. It shows vividly how ideas grow, change, improve and are discarded, and how advances in understanding may come not in a single giant leap, but through the aggregation and refinement of many smaller insights. It shows the persistence required to solve a difficult problem, often in the face of considerable uncertainty, and how even the best mathematicians can make basic mistakes and pursue many failed ideas.”

Gathering a brain trust like this is nearly impossible in the usual hierarchical organization, and I believe open-source approaches are far more capable of solving difficult technology problems. I think the future of scientific computing is to learn how to grow, manage, and coordinate large, ad hoc communities (i.e., address the scalability problem). This will challenge many of us, whether we are trying to coordinate international teams of academic and industrial researchers (e.g., a National Center of Biomedical Computing such as NA-MIC), or businesses that must learn how to assemble, manage, and motivate disparate communities to provide effective technology solutions.

Business Models

It’s hard for me to believe that we are still talking about open source business models, particularly since Kitware has been profitable every year since its founding in 1998, and continues to grow at a 30% clip. But I get enough perplexed looks and questions that I can see people still don't understand how we make a living. It’s simple really: Kitware provides services and solutions.

Services may include the usual suspects such as documentation resources, training and support, but we also serve as co-creators of advanced technology, frequently taking the role as software engineering leads, and collaborating with our customers and partners to develop advanced research solutions or leading-edge technological applications. A very common business arrangement is for us to team with our customers, and here I am talking about big (and little) players in the pharma, medical device, simulation, and oil & gas businesses; or with academic research organizations to tackle large projects that may run years in length and cost millions of dollars to execute.

Having made a successful living for almost fourteen years now, albeit (in the early years) with a chip on my shoulders trying to answer the embarrassing question “what is Kitware’s business model?” I have become a true believer in the Way of the Source. In particular, in the way it benefits business and the public good. Beginning with authentic technologies based on Open Science, it's easy to sell our advanced technical solutions combined with a low-cost business model built around our scalable, agile, flexible software development environment, which supports inter-organizational collaboration.

Our customers know that our technology is real because they can evaluate it; they recognize the importance of a quality-inducing software process; and they understand the importance of being able to collaborate in an iterative, community fashion to deliver solutions that actually meet their needs.

One curious business practice that our customers have pointed out to me recently is that many of the commercial simulation businesses have gained positions of semi-monopolies, and are actively ratcheting licensing costs to reflect their market power. Further, these vendors often do not provide licensing relief as the number of CPUs on computing hardware increases, resulting in burdensome software maintenance costs.

Hence, many customers have told me, confidentially of course, that they are livid over vendor abuses and are actively seeking alternatives such as open-source solutions. What we are typically seeing is that companies begin with open source as a cost savings measure, and once they gain experience with the collaborative, agile, and rapid technology development process that open source engenders they realize, to their delight I might add, that what began as a cost saving measure turns into an engine of rapid innovation and process improvement.

There is an important psychology to the business model at play here as well. In the purchase-license-based business model, companies pay up front for software solutions. Of course the vendors develop these solutions in a generic way to address a broad market; therefore the software is never optimized for a company’s workflow, and generally requires significant, additional resources to integrate. Further, vendors respond slowly to the bugs and feature requests, as they must maintain them across the demands of the broader market.

Under this current environment, the whole process of paying excessive licensing fees feels like profound betrayal to many of our customers since they pay up front to buy access to a technology; have to pay more to customize it (which the vendor owns); and then pay yet again as licensing fees ratchet up due to monopolistic practices or CPU additions. As a last straw, these same customers that were often instrumental in the success of the vendor end up trapped in a proprietary cage that is too expensive to leave. Ouch!

In comparison, an equivalent open source solution may cost as much (we pay Kitware staff well), but when the dust settles the customer ends up owning what it paid for, has shouldered the financial and technical integration burden over time, and has directed most of the expended funds towards integrating the technology into a customized workflow that yields a superior end product.

With the right licensing model (avoid GPL and reciprocal licenses), a company can choose to hold back proprietary features, or open-source selected technology so that the maintenance burden is taken up by the open source community. Not only this, but these companies realize that they can change vendors, provide support themselves, and/or find alternative solutions for future development and maintenance—they are no longer trapped by decisions made by an outside vendor.

Another important psychological aspect is the freedom that companies realize when they deploy their customized, open source-based technology solutions across their enterprise--the whole onerous process of negotiating IP is gone, and as a result open-source software solutions have much wider impact and can be disseminated much faster.

The Future of Open Source

I believe that within the next decade or two there will be an open-source alternative to every proprietary software product, especially those aimed at scientific computing. Eventually the scalable nature of open-source development will swamp anything any single vendor can do. Those that do not start collaborating with their community and customers, including in providing open-source solutions, will likely see severe business impacts even to the point of going out of business.

I’m sure that there are some that are skeptical of this claim. However, it seems clear to me that the needs of 1) open science, 2) authenticity, 3) agile, high-quality software process, 4) scalability and 5) collaborative, customer-friendly business models will lead the way. In the end, I am confident that open source will rule scientific computing.