Universal Language: The Pistoia Alliance Takes on Indescribable Biology

The Pistoia Alliance, founded after a meeting between members of Pfizer, AstraZeneca, Novartis and GlaxoSmithKline, has come to resemble a United Nations of the life sciences industry. Now in its fifth year, the Alliance’s membership has grown to include nearly all the largest pharma companies (Eli Lilly is the only holdout in the top ten) plus a huge assortment of publishers, IT vendors, small biotechs and academic groups. It makes for a complicated network of business partners and competitors, but they do have some basic needs in common. In particular, the Pistoia Alliance exists to build IT architectures that serve the precompetitive stages of research and development.

“The key to the Pistoia Alliance is that, as time has gone by, most companies have figured out that you can’t go it alone,” says Sergio Rotstein, the Director of Research Business Technologies at Pfizer and a member of the Alliance’s board of directors. “Even the tightest of companies has opened up its walls quite a bit to collaboration… The idea of me asking my buddy from Merck, how did you solve that problem, and by the way would you mind giving me the solution — ten years ago, that would have gotten me laughed out of the room.”

The Pistoia Alliance has previously sponsored new methods for querying databases and the scientific literature, and a more effective algorithm for compressing and sharing genetic sequencing data. Over the past year, another Pistoia project, HELM, has entered the public domain after gradual development by an assortment of Alliance members. An open source language and set of editing tools for working with large biomolecules, HELM has already become a foundational part of research in at least three large pharmaceutical companies.

At the Bio-IT World Best Practices Awards this April, the HELM project won the Pistoia Alliance a top prize in the category of Informatics. These awards recognize advances in information technology and good management strategies at all levels of the biomedical industry. While the Best Practices Awards always seek to highlight programs that could be widely replicated, Bio-IT World rarely has the opportunity to single out a project that has been adopted so quickly across so many organizations as the Pistoia Alliance’s efforts around HELM.

A Loss for Words

HELM addresses a problem at the root level of drug discovery. Pharmaceutical and biotech companies are looking at increasingly complex molecules in the search for new therapeutics, testing out RNA- and peptide-based compounds that tap directly into cellular pathways. The trouble is that these large molecules, which are often hybrids of RNA, amino acids and other chemical structures, are difficult to concisely describe, even when their structures are perfectly known. They are too large and ungainly to represent atom-by-atom, but not uniform enough to be reduced to nucleotides and peptide chains.

“There have been a number of ways to represent small molecules,” says Rotstein. “That’s been the bread and butter of a number of companies for a long time, and that’s the realm of cheminformatics. And there’s been a lot of methodology for dealing with sequence-based entities, like genes and proteins, which is the realm of bioinformatics. The issue is that the types of molecules that we are targeting fall in between these two.”

This isn’t just a semantic issue; not having a standard language for biomolecules has practical consequences. It’s hard to register these molecules in databases, and even harder to conduct searches for them or share their structures with collaborators. The problem has recently come to a head, as growing knowledge of interlocking cellular systems has led researchers to therapies that increasingly resemble the body’s own tangled biology. “It follows the natural progression of science itself,” says Rotstein. “The application of peptides with unnatural amino acids, and the area of antibody -drug conjugates, has been growing a great deal over the past few years. A lot of the companies that traditionally worked in the small molecule space, nowadays are looking for a diverse portfolio.”

In 2008, Rotstein was part of an oligonucleotide unit at Pfizer that set out to build a new language to describe the compounds it was working with. The language would be similar to the small molecule notation SMILES (the Simplified Molecular-Input Line-Entry System), which renders a chemical structure as a continuous string of characters, while using symbols from the ASCII alphabet to resolve properties like where bonds occur and how molecules branch. Instead of using atoms as the smallest units in the chain, however, much larger groups — monomers like nucleotides and amino acids — would receive short, unique IDs that could be strung together into polymers. The amino acid cysteine, for instance, could be represented simply as “C.” New monomers would be registered with new IDs in a central database, and every ID would be linked to a complete description in small molecule notation.

Helm notation

A complex oligonucleotide peptide conjugate, featuring amino acids, RNA, and other chemical structures. The molecule is rendered as both a monomer graph, and in HELM notation. Reproduced from the Journal of Chemical Information and Modeling with permission of the author

The language was called HELM, the Hierarchical Editing Language for Macromolecules: “hierarchical” because strings of monomers are built into simple polymers, which in turn are joined into complex polymers. HELM was easy to use and unambiguous, and was soon adopted in many more departments at Pfizer. For the first time, it was possible to quickly enter a new macromolecule in Pfizer’s registry, check for uniqueness, and receive a corporate ID to take the project forward.

A Living Language

At the same time that Rotstein’s team was developing HELM at Pfizer, other pharma companies and informatics vendors were struggling with the same problem. The software provider Accelrys (now BIOVIA), for instance, had modified the Molfile chemical table format to deal with hybrid macromolecules, in a system the company called the Self-Contained Sequence Representation (SCSR). There was a danger of proliferating standards, which would not only create redundant work at each company writing its own language, but also threaten the ability of these organizations to share information with each other.

Meanwhile, a member survey at the Pistoia Alliance flagged the representation of complex biomolecules as one of the industry’s top three non-competitive problems. Since Pfizer had already published a paper on HELM and built a software toolkit around the system, the company volunteered to make the entire program open source and continue its development with other members of the Alliance.

“We saw an opportunity for Pfizer,” says Rotstein. “If this did indeed become a standard, and the open source tools continued to evolve through contributions of the whole community, that would help us too.” All told, 24 companies sent volunteers to work on HELM, untangling the code from Pfizer’s internal systems, making it public, and extending the tools that serve the language.

The entire HELM project is now available on GitHub, and uses the permissive MIT open source license, which gives anyone the right to download and modify the code without requiring any contribution back to the project. That should encourage vendors to build commercial software on top of HELM, helping to foster compatibility across the industry.

The basic HELM toolkit includes search functions and uniqueness checks, as well as the HELM Editor, a platform for drawing chemical structures. The HELM Editor lets users plug in or draw monomers, then move up the scale to polymers made from those building blocks. It can be used simply as a translation tool, taking existing structures and giving them names in HELM notation, but Rotstein says it would also be a preferred platform for making new molecules from scratch.

HELM Editor

A screenshot from the HELM Editor, showing a siRNA molecule under construction. Image credit: Pistoia Alliance

Since HELM was released to the public last year, development has continued at various partner organizations. Roche was one of the first adopters, and has been relentlessly adding functionality to the toolkit. “Roche created a custom antibody-drawing capability on top of the HELM Editor, and it’s truly phenomenal,” says Rotstein. “They are now putting the finishing touches on that, and as soon as they’re done, they are pushing it right back out into the open source.”

He adds that Pfizer plans to start using Roche’s antibody drawing tool itself. “That tool alone will probably return our entire investment on externalizing HELM.”

Most recently, this month the Pistoia Alliance released Exchangeable HEL M, another big push for interoperability. While some basic monomers, like the natural amino acids and nucleotides, have universal IDs in HELM, most monomer IDs are unique to each user, stored in an internal database. That’s a necessary feature to make HELM flexible to the needs of every user, but it means that most molecules only make sense in the context of the databases against which they were designed.

Exchangeable HELM provides a file format that includes both the larger HELM sequence of a macromolecule, and separately, the chemical structure of each monomer inside it. That makes it easy for collaborators — say, a large pharma company and a CRO hired for a specific project — to send molecular structures back and forth. Exchangeable HELM also offers a tool to “translate” between databases, if two organizations have different internal IDs for the same monomer.

The Lingua Franca

So far, Pfizer, Roche, and Lundbeck are the largest drug companies to switch their systems over to HELM, and Rotstein says a “robust pipeline of other companies” is preparing to adopt the language. Meanwhile, vendors that serve the drug industry are preparing for a widespread change. NextMove Software and ChemAxon are both working in HELM, and even BIOVIA, which plans to continue using SCSR internally, has made its systems compatible with HELM to more easily share large molecules with clients and partners.

The adoption of HELM will be buoyed by public resources in the life sciences that are turning to the language as the obvious choice for representing complex molecules. One big supporter is the European Bioinformatics Institute, whose ubiquitous ChEMBL database of chemical compounds will include HELM notations in its next release.

Increasingly, says Rotstein, the Pistoia Alliance is speaking of a HELM ecosystem. “We want to have content providers that have structures in HELM format. We want vendors whose software can read and write HELM. We want companies that use HELM as their standard, we want CROs that can use HELM to exchange information with those companies, and next on our list are downstream things like scientific journals and regulatory agencies.” Large publishers and regulators would be especially important adopters, because they are such frequent and public ports of call for companies sending macromolecular structures outside their walls. If the FDA or Nature Publishing Group began accepting HELM structures, it could be a major convenience when applying for clinical trials and publications. “It would be much easier to just send a file that says, ‘here’s exactly what my structure is,’” says Rotstein, “rather than having to verbally explain the structure.”

Having HELM in place as a widely-shared language could also benefit other Pistoia Alliance projects. For example, the Controlled Substance Compliant Services Project is currently building a database of compounds that are regulated or restricted in various countries around the world, so companies can quickly refer to the local legislation affecting compounds they want to work with. If large biomolecules are subject to regulations, HELM would be a convenient way to make those policies searchable.

Like other Pistoia Alliance initiatives, HELM is designed to run smoothly in the background. Defining the structure of macromolecules, and manipulating them in a standard format, is not a process that should offer any company an edge in drug discovery, but a basic feature at the foundation of the life sciences. In an ideal world, says Rotstein, “this should be a non-issue. The ability to represent these molecules, and get them in and out of our system so we can store them, search them, and run calculations on them, should be trivial.”

Universal Language: The Pistoia Alliance Takes on Indescribable Biology was written by Aaron Krol and originally published in Bio-IT World. It is being republished by Open Health News with permission. The original copy of the article can be found here.