Home | This Week's Contents  |  C&EN ClassifiedsSearch C&EN Online

 C&EN 75th Anniversary Issue
Related Stories
Bioinformatics: What Is It?

Looking At Small Molecules

Related Stories
[C&EN, February 12, 2001]

[C&EN, January 1, 2001]

[C&EN, February 7, 2000 ]

Proteomics: Taking over where genomics leaves off
[C&EN, July 31, 2000]

Related Sites
Discovery Partners International
Celera Genomics
Genome Therapeutics
Structural Bioinformatics Inc.
Incyte Genomics
Lion Bioscience
Lexicon Genetics
Decision Resources
MDL Information Systems
Gene Logic
Genetics Computer Group
E-mail this article to a friend
Print this article
E-mail the editor
 Table of Contents
 C&EN Classifieds
 News of the Week
 Cover Story
 Editor's Page
 Government & Policy
 ACS News
 Digital Briefs
 ACS Comments
 Career & Employment
 Special Reports
 New Products
 What's That Stuff?
 Pharmaceutical Century

 Hot Articles
 Safety  Letters

 Back Issues

 How to Subscribe
 Subscription Changes
 About C&EN
 Copyright Permission
 E-mail webmaster
February 19, 2001
Volume 79, Number 8
CENEAR 79 8 pp.26-45
ISSN 0009-2347
[Previous Story] [Next Story]

Making sense of information mined from the human genome is a massive undertaking for the fledgling industry


ENTERPRISEWIDE Software developers at InforMax are focusing on large-scale systems rather than desktop applications to search through genes, building proteins on the genes, and cascading them together.
Welcome to the postgenomic era. after years of sequencing, scientists have uncovered the millions of pieces of information behind the 23 pairs of human chromosomes, with the help of bioinformatics. A wealth of genetic information is now becoming available to investigators who will mine the data for drug leads. But first, scientists need to put the genes in context: When are they turned on and off? What do they do? Which proteins do they correspond to?

As Dan Levine, director of business development for bioinformatics company Xpogen, says, "Our goal is to make sense of the Human Genome Project. We want to bring things to the next stage, finding out not just what the human genome is, but what it does." Without bioinformatics, he adds, it would be impossible to find patterns in the vast sea of data that is being generated.

The availability of the genome sequence is just the beginning, says Arnold Hagler, president and chief executive officer of Structural Proteomics, a bioinformatics subsidiary of Discovery Partners International. Scientists now want to understand genes, their function, and the role they play in the prevention, diagnosis, and treatment of disease. Ultimately, Hagler says, the goal is to identify patterns in this information that can be used to develop more effective therapeutics--drugs that work more quickly, are safer, are less toxic, and have better bioavailability.

THE POSTGENOMIC ERA additionally means that scientists will be moving more toward the study of proteins, a more complex task. Proteomics--the study of proteins and their functions and interactions--is a new field with myriad variables to be accounted for, as scientists convert gene sequence data into protein structures and investigate the genetic variations between individuals, says Anthony R. Kerlavage, senior director of product strategy at Celera Genomics. Furthermore, scientists do not yet know the rules for figuring out the relationship between protein gene sequence and protein function.

The body contains more than 1 million proteins, which regulate the structure of cells and tissues and trigger or deactivate genetically linked diseases such as Alzheimer's, cancer, and diabetes. Knowledge of protein structures is key to understanding these diseases. Proteins, not genes, are the endpoints of life sciences investigations, since proteins ultimately regulate metabolism and disease in the body. Drugmakers believe that an understanding of proteins will lead to new therapies that will revolutionize the way disease is diagnosed and treated.

The pharmaceutical research and consulting company Decision Resources anticipates that vast amounts of data will be produced over the next four years both in genomics and in proteomics. Celera, part of Applera, the former PE Corp., is perhaps the best known company participating in the cataloging of genomes; it spent its first two-and-one-half years cataloging human genes, and now is moving from the sequence to the function of genes to learn how they work together. There will be growth in database management as the amount of data to organize explodes, and high-throughput screening will increase as researchers search for proteomics discoveries, which will require new instruments and analyses.

According to Arthur T. Sands, president and CEO of drug discovery company Lexicon Genetics, the greatest potential growth area for companies like his is in "the seminal discoveries of therapeutic proteins and validated drug targets. The greatest bottleneck is not gene sequencing, as it was until last summer, but the discovery of gene function."

Given that there are roughly 30,000 genes in the genome versus about a million proteins circulating in the human body, the study of proteins is an order of magnitude more complex than the investigation of the genome, says Steve Gardner, vice president and chief technical officer of Viaken, an application service provider for the life sciences industry. "If dealing with the genome is a terabyte problem, dealing with the proteome will be a tens-of-terabytes problem" Gardner says. There are huge variations in the protein profiles of individuals, much more so than the variation in human genome.

The problem of the genome is by no means only a computing problem; it also creates a nightmare for the "wet lab," where experimentation takes place. "Although great progress is being made by instrumentation makers, it will be some time before we can be as accurate and reproducible with protein expression as we can be with genetic expression," Gardner says.

The drug discovery industry depends on bioinformatics companies to help them filter the huge number of genes, not all of which are associated with disease and most of which would not be good drug targets, says Andrew DePristo, vice president of bioinformatics, information technology, and emerging technologies at drug discovery company Genome Therapeutics. There is a need to prioritize candidates to decide which to bring into screening trials. Companies like Genome Therapeutics rely on a combination of software developed in-house and purchased to help them with this task.

WHAT IS NEEDED is a common language in which disparate pieces of data are expressed and a set of computational tools that will help interpret these data. Bioinformatics aims at combining cutting-edge molecular biology with supercomputing to store, retrieve, process, analyze, and simulate biological information--that is, to provide the tools necessary to turn masses of raw information into knowledge.

This wealth of information presents a challenge to pharmaceutical and biotechnology companies: Genomics gives many more targets than they can deal with. Currently, the targets for pharmaceuticals number in the hundreds, but with the data from the Human Genome Project, there will be up to 10,000 new targets, according to Scott Hutton, president of ChemNavigator.com, an online source of druglike chemical compounds.

7808cover1xTRACKING SYSTEM Tripos' ChemInfo system monitors plates of micronic tubes containing compounds for high-throughput screening process.
The advantage of this abundance of data is that better targeted drug treatments will be possible. Small differences in genotype, or the genetic makeup of individuals, result in different phenotypes, or physical characteristics, with the consequence that drugs may help some people but end up harming others. With knowledge of how different genotypes affect the function of drugs, treatment regimes can potentially be customized based on genetic information associated with a specific patient.

The downside is that the weight of data is overwhelming to the individual researcher, who often does not know how to maximize its usefulness. But many bioinformatics companies are ready to lend a helping hand. The issue is which of these companies will survive consolidation in the industry as it begins to choose standardized tools and software. Kal Ramnarayan, vice president and chief scientific officer of Structural Bioinformatics Inc. (SBI), hasn't observed much consolidation yet since most companies still have enough capital to keep growing independently.

As the pharmaceutical industry seeks to maintain growth through development of new drugs, bioinformatics companies will play a larger and larger role. Indeed, bioinformatics is now considered essential for drug discovery. At the same time, many bioinformatics companies are no longer "pure plays": As they expand to fill the needs of the pharmaceutical and biotech industries, they are beginning to become hybrids with both pharmaceutical and bioinformatics elements.

According to Bill Ladd, director of bioinformatics at software producer Spotfire, bioinformatics developed from the need for computer-aided biological research. It got off the ground in the late 1980s as sequence analysis data required mathematical interpretation. Since then it has grown to include all of molecular biology.

Recent research by London-based consulting firm Silico Research indicates the market for bioinformatics software and services is growing at 17% annually, and it is expected to reach $110 million by 2004. Other estimates suggest the market is even bigger. Research and consulting firm Frost & Sullivan, for example, has predicted sales of $160 million in 2000, with the potential to grow to as much as $5 billion in the next five years. In comparison, the global pharmaceutical industry is worth more than $150 billion per year.

CHALLENGES ARE everywhere for the bioinformatics industry. It must be able to deal with increasingly complex data and to integrate data sources into a single system, says Felicia M. Gentile, president of consulting firm BioInsights.

Martin D. Leach, director of bioinformatics at drug discovery company CuraGen, agrees that the amount of data is staggering. For example, microarrays now allow for the generation of data points for thousands of genes instead of the single genes that were investigated in traditional experiments. This flood of data has resulted in a glut of candidates to develop into drugs. As a result, pharmaceutical companies must learn portfolio management to determine which drug candidates to pursue.

Diverse types of data must be handled simultaneously to provide a better understanding of what genes do. New analysis strategies, Leach says, are required to determine patterns in information. Once the information is in an organized state, bioinformaticists can apply data mining strategies from other fields--those used by spy satellites and astronomy, where there is a need to pick up weak signals.

According to Decision Resources, the bioinformatics industry has to meet three chief informational challenges if it is to make sense of the avalanche of data coming out of genomics and proteomics research: annotating data better; filtering, visualizing, and analyzing data using better algorithms; and integrating genomic and gene expression data more effectively. These challenges share the need for data organization, Ladd adds.

STANDARDIZATION IS KEY to future success in the genomics, proteomics, and gene expression fields, Genome Therapeutics' DePristo says. Annotation is the process of attaching comments to data labels and making connections to related data. It's a knotty problem because there has been no coordinated annotation system as databases have developed.

Teresa W. Ayers, CEO of software company Genomica, says, "Our experience is that pharmaceutical customers are storing their valuable and complex biological and genetic data in multiple databases in various locations. They have created islands of data that are not correlated." Genomica's products work from an Oracle platform to enable data integration and high-security data sharing.

In genomics, there is a need to specify how assays were run, how data were analyzed, and so forth, and to attach this information to the data. Information on a given gene must be maintained through subsequent experiments, and literature results must be available to deal with ever more complex questions.

As it stands, new data can't be correlated to old results because these "legacy data" were collected under standards that were different or unknown. Archived annotation for a particular gene is often out of date, inconsistent, and unstructured. Thus companies end up with a lot of data obtained at great cost that cannot be used beyond the initial purpose for which it was collected.

And the problem will only continue because annotation cannot be used to discover underlying relationships in data unless it is continually updated and refined. Proteome, an annotation company recently acquired by Incyte Genomics, has found that annotation in public databases is wrong or misleading up to 25% of the time. Proteome is an example of a company trying to improve this situation by manually reviewing data to create a reliable annotated database for the yeast genome.

Reclaiming these data will be a great growth area for the industry, notes Howard D. Goldstein, president and CEO of Entigen, the life sciences application service provider that resulted from the merger of eBioinformatics and data integration specialist Empatheon. If data are annotated properly, a researcher in the future can pull out and use information that was originally collected for an entirely different purpose.

When a large enough population of data has been collected, patterns and relationships can be determined, for example, between the genomes of people who have rheumatoid arthritis and those with a seemingly unrelated ailment such as migraine headaches. "No one throws away data anymore," Goldstein says. "It is not even a matter of it being commercially valuable. In the aggregate, all these data have value, even if they do not as individual pieces."

COMPANIES ARE AWARE of the annotation problem, but they vary in the degree of headway they are making in solving it. Genomics and bioinformatics firm DoubleTwist, for one, has addressed the situation with its proprietary Annotated Human Genome Database. It is "a regularly updated source of extensively annotated human genomic DNA sequences available for use on a customer's internal computer network through a product called Prophecy," according to Kyle Hart, the firm's director of bioinformatics. DoubleTwist's Prophecy ToolKit allows customers to integrate annotated human genome data with its proprietary sequence data in a common environment to visualize and mine the data.

As companies with a bioinformatics business consolidate, a related problem springs up: integrating the multiple databases that must communicate with each other. Each company has its own system of registering or numbering compounds--a system that must be made compatible with the system used by the other company. The lack of standards hurts companies that are attempting to merge data and annotations.

According to Goldstein, another major challenge for the bioinformatics industry is trying to anticipate what will be the next useful data sources and analysis tools. This situation is reflected in the wet labs, in which hands-on test tubes and spectrometers have given way to automated DNA sequencers.

Furthermore, as new technologies are created, different data types must be melded into software systems. "How to relate one type of data to the next is not as simple as recalling it from a database," Goldstein says. "You have to normalize the data content so the researchers can make comparisons from one technique or type of data to another. The challenge is to understand the new data types as they come along and react quickly so that researchers can maintain this integrated work space."

The development of better algorithms or "killer applications" to analyze and visualize gene expression data and to integrate them with other information is a critical part of making such data more amenable to interpretation. NetGenics, for one, is developing a data analysis and mining application called Genotypic DataMart in a partnership with Genomics Collaborative. DataMart not only is a tool for accessing and analyzing annotated genotypic data in the search for promising drug targets, but it eventually will host the world's largest library of genotypic data, the company claims.

To coordinate its suite of bioinformatics applications, NetGenics has developed a computing framework that facilitates the drug discovery process by linking researchers, computing systems, and information enterprisewide. This framework has been adopted by pharmaceutical companies such as Abbott Laboratories. It also has been augmented by a link to Incyte Genomics' genomic database software and sequence data.

A RELATED CHALLENGE for the future is the need to integrate informatics with the R&D process. This integration has been going on for the past 10 years, Viaken's Gardner says, but it has really only gathered steam in the past few years. As this work proceeds, scientists are realizing that life sciences R&D is really an information science in which data are gathered through automation and a continual profusion of new technologies. Information from earlier stages in the R&D process must be at scientists' disposal so it can be brought to bear later in the decision-making process. The trend toward this integration "is absolutely in full flow and will continue to grow over time," Gardner says.

SBI's Ramnarayan states that one of the major challenges within bioinformatics is finding enough people who are proficient not just in computation or in biology but in both fields. A hybrid person is needed because a biologist can't be effective if he or she treats the algorithms as a black box, nor can a computational person be productive without biological knowledge.

David Smoller, senior vice president of marketing at Incyte, believes that information science expertise is the more important quality. To him, the ideal bioinformatician would have a degree in computer science with a background in biology.

SBI generates protein structural information and three-dimensional protein models. This information helps scientists to design drugs that bind to proteins--either initiating or blocking their activity--by analysis of the protein models and prediction of their interactions. A programmer cannot be very productive in performing these computational experiments without a biology background, Ramnarayan says.

Partnerships are an important part of the bioinformatics industry. SBI, for example, has a strategic partnership with IBM's Life Sciences business unit in information technology. IBM provides hardware and software that are expected to enhance SBI's ability to perform high-resolution protein modeling. IBM provides complementary expertise in managing the volume and complexity characteristic of proteomic data, notes Edward T. Maggio, chairman, president, and CEO of SBI. Through the IBM collaboration, SBI's protein structure databases are becoming more available on the Internet through a subscription service.

In another collaboration, with genetic medical testing firm Quest Diagnostics, SBI has developed a series of database modules of structurally variant proteins, called Variome, for sale to pharmaceutical and biotechnology companies looking to understand the differences in interactions between a drug and the structural mutations of the intended drug target.

In addition to SBI, companies are forming partnerships to help accelerate bioinformatics. Drugmakers are collaborating with bioinformatics companies to develop drugs tailored to an individual's genetic makeup. Perhaps the largest partnership to date is the complex $100 million deal between Bayer and Lion Bioscience to store and analyze genomic data. In October, Lion and Tripos announced an expanded partnership to provide Bayer with an integrated cheminformatics technology platform to speed Bayer's identification of drug and agricultural chemicals candidates.

In November, Celera and Lion announced a strategic alliance in which they will create improved software tools and extend Lion's SRS system to organize and analyze biological data such as that produced by Celera. In addition, Celera will market Lion's automated genome annotation, comparison, and expression analysis tools bioSCOUT, genomeSCOUT, and arraySCOUT.

Another notable pharma-bioinformatics partnership is the 1998 multi-million-dollar deal between Compugen and Pfizer. Compugen's president, Eli Mintz, says his company provided Pfizer with software that recognizes the fallacy of the old model in which one gene codes for one protein. Instead, the Compugen software allows for one gene to code for several messenger RNA molecules and hence several proteins.

Partnerships do not often make bioinformatics firms a lot of money, BioInsights' Gentile points out, but they do provide opportunities for these companies to provide niche tools. In addition to such partnerships and agreements, bioinformatics companies are collaborating with academic partners that experimentally validate genes that the bioinformatics software has identified.

 THE HISTORY of the bioinformatics industry is reflected in today's landscape of companies. According to Entigen's Goldstein, "The industry that calls itself bioinformatics has within it a whole series of companies that used to be called software companies, service or hosting companies, custom development houses." The field is fragmented because of the differing sets of needs of researchers in the biology life sciences industry.

Because researchers had to solve particular problems well before there were software packages designed for their specific needs, biologists and chemists learned how to program for their own purposes, creating programs that were useful to their colleagues and that they distributed free or for remuneration. Those that did a particularly nice job often formed their own software companies.

For example, Thomas Marr developed the framework for Genomica's initial product, Discovery Manager, while at Los Alamos National Laboratory and at Cold Spring Harbor Laboratory. Scientists working in the field tested this software before it was commercially released, providing feedback about its functionality and software design.

The roots of bioinformatics companies, then, are with researchers with problems who created solutions for themselves. As a result, some bioinformatics work takes place in separate companies, while some remains where the scientist is--either in a corporate, university, government, or private laboratory.

Drug discovery firms like Lexicon Genetics often include a bioinformatics component although their real business lies in other areas. Lexicon uses its own drug discovery software in its efforts to organize and compile genomic data, and it takes bioinformatics to the next step--the association of genes with medical functions, Sands says.

For example, Lexicon has invented a novel "gene-trapping" technology to identify the genes contained within the human genome more rapidly and cost-effectively traditional approaches. Thousands of novel genes have been identified with this technology and are stored within Lexicon's Human Gene Trap database, and 50% of the gene sequences are not represented in public databases. Pure DNA sequence information, Sands says, is becoming a commodity. To create high value, it is necessary to use bioinformatics for drug discovery or diagnostic purposes.

There are several approaches to make data accessible to researchers and to integrate these data across the spectrum from genome to expression to clinical trials. Bioinformatics encompasses tasks that span from a single desktop computer to a bunch of linked computers to a Web-based approach.

THE INTERNET is used by some informatics companies as a consumer interface. But only when companies use the Internet as a true information highway, not as a mechanism for e-commerce, can the real potential of the Internet can be exploited, Goldstein believes. "Mechanisms of action can be understood more quickly because the flow of data is instantaneous," he says.

Entigen's BioNavigator website offers a home for "computer-based biological research via the Internet." Goldstein says scientists shouldn't need to understand the underlying computer science unless they are computer scientists.

BioNavigator is a "middleware" provider--that is, it provides the software infrastructure that integrates data sources with the necessary software tools. It allows scientists to control their data and analyses "in real time and reproducibly." BioNavigator goes beyond middleware, Goldstein says, in that it gives biologists a desktop in which they can pick and choose software and databases without having to worry about formats.

This type of service, though found in very large applications such as banking, is not common in biological applications because of the need for domain knowledge, which is scientific knowledge that underlies the application. In creating this technology, Entigen was motivated by the desire to bring biological resources to the masses of scientists who don't have the funds to subscribe to all the databases available or to purchase the software, Goldstein says.

In 1999, Compugen launched what it calls the first Internet life sciences research engine, LabOnWeb.com. Using this platform, which brings Web-embedded knowledge into the traditional laboratory information flow, researchers can enhance productivity and speed drug discovery. LabOnWeb takes advantage of the Internet's salient feature--speeding and facilitating information transfer--to boost the value of laboratory experiments and to broaden access to both public and proprietary knowledge.

Another example of an Internet portal is CuraGen's Genescape, which is intended to provide a foundation for genome-based drug discovery and development. Genescape allows Internet access to functional genomics technologies such as CuraGen's GeneCalling, SeqCalling, and PathCalling software, which provide tools for gene discovery, gene sequencing, and genetic relationships in biopathways, respectively.

A SECOND NICHE in the bioinformatics market is in providing software packages for data visualization, interpretation, and analysis. Information technology-oriented companies such as Spotfire, MDL Information Systems, Genomica, and Silicon Genetics prepare custom software for a diverse group of customers. Customers and partners are the "content" providers.

A tool provider like Genomica does not supply data, but rather puts its systems into a customer's facility and customizes them. Genomica is working on a standardized tool for managing the discovery process. Given the need to combine new data with legacy data and move them into a single place, Genomica is moving to an Oracle platform to address the integration and consolidation issues in the wake of mergers and to allow collaboration between groups of scientists.

Spotfire markets its decision analytics software to speed the discovery, development, manufacture, and marketing of products. It provides an environment that allows people to interact with data and determine which experiment to do next. Spotfire's Ladd notes that few bioinformatics companies remain as pure-play software companies for very long. However, he thinks the vulnerability of the pure bioinformatics company does not apply to Spotfire, since it operates with a much larger base than just biotech and biological customers.

Many bioinformatics companies are expanding beyond their traditional role. Some are even beginning to identify drug candidates themselves. Unlike firms such as NetGenics, which markets only bioinformatics software and services, these large bioinformatics companies are also genomics companies, combining their expertise in bioinformatics with genomic capabilities, often in partnership with pharmaceutical or agrochemical companies. Such "bioinformatics-plus" companies include Compugen, CuraGen, Gene Logic, Lion Bioscience, and Rosetta Inpharmatica.

CuraGen, for one, uses an assembly-line process to generate information about diseased tissues; the output is a list of genes relating to that disease state, Leach says. CuraGen was founded by engineers who used a process engineering mind-set when designing the company, so samples are followed meticulously to ensure the scientist knows when and where each bit of data was generated.

About a quarter of CuraGen employees work in bioinformatics in the traditional sense, and the rest are in the lab synthesizing proteins, generating data, screening for compounds that aid or prevent a protein from functioning, and analyzing experimental results. Bioinformatics is needed to keep up with the millions of data points, to analyze experiments to determine compound toxicity, and to package data for submission to the Food & Drug Administration. CuraGen software developers, Leach says, interact with CuraGen scientists on a daily basis, so they are more in touch with research needs than are the software developers at many other pharmaceutical companies.

Other bioinformatics companies offer data architectures and expertise to large pharmaceutical clients. This group of companies includes both dedicated bioinformatics companies and information technology companies such as IBM and Motorola, both of which are jockeying for a piece of the biotech information technology pie. This past November, for example, Motorola announced that it was entering an agreement with Compugen to develop and manufacture DNA biochips using Compugen's LEADS platform for detecting and analyzing genomic and proteomic sequence data.

7908BIFX DRUG DISCOVERY Genome Therapeutics computational resources allow extensive analysis of terabytes of genomic data.
ANOTHER APPROACH to the challenge of providing bioinformatics tools is exemplified by companies like Incyte Genomics, which have developed an integrated platform of genomic technologies, including genomic databases, partnership programs, data management software, and gene expression services.

Such firms--which also include CuraGen, Celera Genomics, and application service providers Entigen and Compugen--provide content as well as comprehensive bioinformatics platforms. This hybrid platform is designed to facilitate the task of researchers working in all phases of drug discovery and development. For Incyte, the first step is to develop and create the content, or the genetic information; the second step is to do the analysis, Smoller says.

Celera Genomics is a hybrid company, having sequenced the fruit fly, human, mouse, and dog genomes. It also has created a large database of single-nucleotide polymorphisms (SNPs), which are the genetic variations between individuals. Celera is augmenting its genetic data with more functional data that indicate, for example, when a gene is expressed and what genes are expressed in a given disease state. Expression levels of a gene change due to SNPs, resulting in "a mind-boggling set of data and experiments to be done," Celera's Kerlavage says.

In the future, Celera will be cataloging proteins and their functions, identifying disease-specific genes, and discovering gene targets for drugs, vaccines, and diagnostic tests. One of the firms goals is to become the definitive source of genetic information. Yet in many ways it is a bioinformatics company: The bulk of its personnel works on bioinformatics or software, and much of its effort is in creating software to disseminate and analyze data.

However, not everyone considers Celera and its brethren to be true bioinformatics companies. According to Goldstein, companies that own data--the Incytes, the Celeras, the Compugens--are in the business of selling data. They are trying to get researchers to subscribe to their databases, so they are not motivated to integrate their particular data with others. Even though they do provide some software, they are not interested in providing a work space that ties together data from various data owners and tools from various software companies, as Entigen does.

Another example of a bioinformatics company with a broad range of offerings is pharmacogenomic service provider PPGx. The company markets bioinformatics software to explore the connections between SNPs and clinical outcomes. It helps its clients collect genotype data, warehouse the data, augment the data with information gleaned from the public literature, attach phenotype data from patients, apply statistical analysis, and look for correlations.

ANOTHER INTEGRATED genomic information company is Gene Logic, which was started in 1995 to create a genetic reference library using a low-throughput technology until the gene chip was invented; now it uses high-throughput Affymetrix GeneChip technology. In the past year, it has signed up 14 subscribers to its GeneExpress database. Gene Logic also is creating enabling software to analyze and visualize the massive information that is generated by genomics research, notes Robert Burrows, director of corporate communications.

Many changes have taken place in bioinformatics since C&EN reported on the subject a year ago (C&EN, Feb. 7, 2000, page 19). For one, since the human genome sequence data became available this past summer, there has been a big push to develop informatics tools to assemble and annotate these sequences and to build genomic databases. Scientists are becoming more concerned with the analysis of the data rather than the collection, observes Genome Therapeutics' DePristo.

"The genomic sequences are serving as the scaffold for the integration of biological data, including genes, proteins, genetic markers, regulatory elements, and so forth," DoubleTwist's Hart says. For instance, there has recently been a great expansion in the amount of SNP data available as companies study genetic variation between different ethnic groups and between diseased versus healthy people.

Xpogen offers what it calls Relevance Networks, which consist of "gene clustering" algorithms used to statistically analyze expression data to look for similarities in structure patterns that imply a similarity in function. Relevance Networks not only identify functionally related groups of genes, they relate geneotype to phenotype and annotations in a direct and intuitive way, notes Kim Seth, president and CEO of Xpogen.

COMPLETION OF the human genome has forced the bioinformatics industry to change to keep up, says Celera's Kerlavage. Instead of sequencing single genes, companies are now sequencing entire microbial genomes. Tools are therefore required to allow annotation of entire genomes and to compare whole genomes with each other. For example, to determine the function of a human gene, researchers often want to compare it to a mouse gene with a known function.

There is also a metamorphosis from "discovery genomics," in which links are found between genes and functions, to "consumer genomics," in which protein differences between human individuals are detected and correlated with genetic mutations, says Alex Titomirov, founder, chairman, and CEO of bioinformatics firm InforMax. Each step of this evolution is enabled by bioinformatics.

As the biotech industry moves from discovering genes to looking for gene function and predicting biological pathways, the need for bioinformatics grows stronger. According to Compugen's Mintz, "People are beginning to understand that sequence data are just the beginning; there is a huge amount of protein expression and interaction data that must be organized to understand what genes do."

It is now clear that very little can be done with the genome without bioinformatics tools, Titomirov says. Furthermore, the bioinformatics industry in general has become much more sophisticated, he says. It has become so tailored that "the customer doesn't have to worry about what's inside the system. It allows the user not to worry about details."

For example, bioinformatics is being used to help understand toxic response and to be able to predict toxicity much better, says Structural Proteomics' Hagler.

One of the major causes of the failure of drug candidates is the discovery late in the game that a drug candidate is toxic. Being able to address this issue early in the drug discovery process is critical to being able to develop drugs in the future, according to Hagler. Bioinformatics helps researchers analyze patterns of gene expression with computer algorithms that compare the patterns for materials of known toxicity in other organisms to the pattern of a DNA sequence found in the human genome.

Joseph King, vice president of R&D at Genetics Computer Group (GCG), a subsidiary of Pharmacopeia, says one of the big trends in bioinformatics is integration across the research spectrum, from bioinformatics to molecular modeling to small-molecule cheminformatics. GCG created the Wisconsin Package, the bioinformatics industry standard for sequence analysis.

GCG is just part of Pharmacopeia's software business--the rest consists of Molecular Simulations Inc., a molecular modeling organization, and the cheminformatics organizations Synopsys Scientific Systems and Oxford Molecular. The three components complement one another to form a "discovery platform" that bridges from DNA sequences through structural biology. Through its links with the other organizations, GCG hopes to deliver protein structure information and tools in a context useful to molecular biologists.

According to Paul Weber, vice president of software consulting services at Tripos, many people are shifting from the idea of moving data into a single huge warehouse to the idea of "virtual data warehouses" in which multiple databases are linked together. The virtual data warehouse provides a portal that makes these disparate systems appear to the user as a single system. Such decentralized systems are faster to develop and cheaper to install than data warehouses, he says.

Xpogen's Levine adds that there has been a trend toward "enterprise collaborative" capabilities--those tools that help people work together across an organization. As genomics and proteomics work becomes more complex, no one group can handle a given problem, so several groups need to work on it and share information in a user-friendly manner that nonetheless keeps the information protected. Xpogen has addressed this situation with a Web-based application.

InforMax has also devoted a major effort to developing large, enterprise-scale systems, as opposed to desktop applications for a single user. GenoMax, InforMax's enterprisewide computing platform, uses a three-tiered architecture, in which the front end--the interface with the user--uses Java programming; the back end, where data is stored and manipulated, is run by an Oracle system; the middle tier uses UNIX for analysis. GenoMax allows the user to work with genomes by searching through genes, Titomirov says, building proteins on these genes, and cascading them together.

The business is crossing a chasm between small systems and enterprisewide systems, Titomirov notes. While individual solutions were accepted in an earlier generation of bioinformatics programs, the industry has gravitated toward enterprise systems that pull together all pieces of a project and make them available to different groups of customers. Whether or not a bioinformatics company offers enterprise solutions will differentiate which companies are players and which are not, he contends.

Another area of change in the industry concerns the type of data warehouse companies have created. Over the past year, the focus was on collecting DNA samples for which the proper patient consent had been obtained. Having traceable consent is one of the most important parts of developing a collection of genetic data, says Rick Sheridan, vice president for information science at PPGx. Now that methods have been devised to ensure traceable consent and pharmaceutical companies have solved the problem of getting highly attributed samples, the issue is what will be done with this database of DNA.

The next three to five years, Sheridan says, will be spent collecting and organizing data so that the information doesn't end up in "data prisons," which allow one to transfer data in but not out. In devising ways for systems to store and retrieve data meaningfully, PPGx is providing a program for collecting and dealing with enormous amounts of data. To make data meaningful, he believes, pharmaceutical companies must invest in such tools and must standardize how they collect the data so scientists can tell how the data were generated.

Seeing the competitive landscape is important to understanding where the bioinformatics industry is and where it is going. Quite a few players are competing for the research dollars of biotech companies, Compugen's Mintz observes. In bioinformatics, a few well-funded players probably will be around for a while; the others will be bought or will go out of business.

Several companies--including Rosetta Inpharmatics, Lion Bioscience, and Compugen--have gone public. Compugen considers itself uniquely positioned with more than $85 million raised in private and public financing this past summer. In its initial public stock offering (IPO) in September, Genomica raised $122 million, which positions it well to weather the trial period.

MARKET ACCEPTANCE of bioinformatics is indicated by IPOs. Genomica's Ayers says: "Our industry is very dynamic right now, as is evident by the accelerating move toward consolidation. In the past four months, there have been five acquisitions or mergers involving scientific software companies. We see that as being driven by two primary factors: first is the recognition that use of data management tools and resources in drug discovery and development is no longer optional--it is fundamental. The other influence is the current capital market conditions, creating a culture of 'haves' and 'have-nots.' "

7908align1 7908Fm ORGANIZING Genetics Computer Group's OMIGA software allows a user to set color in the sequence alignment editor to represent chemically similar residues (left), and combine the results of multiple analyses with sequence features in a single graphic (right).
All bioinformatics-oriented companies face the same primary source of competition: in-house development of bioinformatics software within drug companies that are big enough to be able to afford to purchase bioinformatics products, Viaken's Gardner says. Many of these companies, such as Merck and GlaxoSmithKline, prefer to provide their own bioinformatics solutions, spending literally hundreds of millions of dollars a year to support the R&D process. Even some smaller companies, such as protein therapeutics developer ZymoGenetics, use their own in-house algorithms, the company's CEO, Bruce Carter, says.

The problem is that not all drug companies can afford to create their own software. This provides an opening for independent bioinformatics companies like Viaken, which offers a mechanism for companies that can't match the investment of the Top 20 pharmaceutical players. Viaken tries to level the playing field by offering standard software products on an outsourced basis. The customer is charged a monthly fee, which reduces its risk since it does not have to invest large amounts up-front. Viaken also will go to a customer's site to work with internal information technology groups.

THE COMPETITION between in-house and outsourcing is for budget allocation--only so many bioinformatics efforts can be funded in a given period of time by a given firm. Because big pharmaceutical companies spend substantial sums of money to develop internal systems, there is a limited pool of funds available to spend on outsourcing.

But in some fields like banking, outsourcing information technology is desirable because the customer does not want to make a big investment in an effort that is not part of its core business. Yet pharmaceuticals is not like other businesses, Entigen's Goldstein comments. Rather, bioinformatics is becoming a key element of the drug discovery process because it assists laboratory research and makes it more productive and effective.

Moreover, BioInsights' Gentile contends that the ability of bioinformatics firms to meet the expectations of their clients has not come to fruition. There are cultural issues as well as technological and management issues that are hard to surmount. From Gentile's discussions with bioinformatics customers, she senses a dissatisfaction, a feeling that the customers can do the job better in-house. By the time a bioinformatics outsourcer comes up with a solution, she has found, the customer has often moved on to other problems.

Bioinformatics industry executives disagree, maintaining that there are advantages to outsourcing bioinformatics activities. One is that outsourcing can produce results more quickly than an in-house effort since outsourcers have frameworks that merely need to be tweaked to be customized for the client. It also is more efficient to buy software than to rediscover it, Genome Therapeutic's DePristo says.

7908Picture1 7908Picture2 7908Picture3 MAKING SENSE Spotfire's Profile Search (top) allows all expression profiles to be compared with a master profile; the nonhierarchical clustering algorithm (center) is used to group objects based on their similarity; and the hierarchical clustering algorithm (bottom) is used with gene expression profiles--on the left is a tree graph that shows the resulting hierarchy.
The future will be ruled by competition, Xpogen's Levine notes. Lots of bioinformatics companies have the goal of selling software that will be on every biological scientist's desk. A huge array of companies that represent different slices of the bioinformatics spectrum are currently getting in position, preparing for the struggle for dominance that will soon take place. There are many targets in the genome, such as large molecules that drugs can act on , but the availability of targets is a boon more to the pharmaceutical companies than to the informatics companies. There are only so many software companies that will be able to be successful in this environment, Levine says.

Making the business even harder is the fact that "there are only about 150,000 molecular biologists, many of those in academia, who are not used to paying lots of money for software but who mainly use tools that are dispersed freely," DePristo observes. Bioinformatics companies often are competing against something that's free, even if it might not be easy to use.

TO SURVIVE, says Phil McHale, vice president for product marketing at MDL Information Systems, bioinformatics companies must offer variety--from complete products for small companies to toolkits for large companies that build their own custom solutions. MDL, unlike some hybrid competitors, does not consider it necessary to have a wet lab to carry out synthesis and screening of compounds.

Instead, as a major player in the cheminformatics field, it places its bets on the large infrastructure it has built up in analysis, visualization, data integration, and decision support tools With these tools, MDL's clients can decide which drug development programs to push through to in vivo screening and which programs to kill early due to factors such as low bioavailability.

Another factor that will determine the success of a bioinformatics company is its ability to protect its intellectual property. Algorithms, processes, computer programs, and functionality can all be patented, Levine says. Patenting can be used to control or eliminate the competition.

However, software tends to be proprietary, InforMax's Titomirov notes. It is sometimes easier to keep secret than to patent, since reverse-engineering binary object code is almost impossible.

Compugen's Mintz concurs that patenting is less critical for bioinformatics companies than it is for biotech firms that are discovering genes and gene functions.

Licensing is also a major feature of academic bioinformatics programs. A stream of announcements have been made about software companies licensing their products to pharmaceutical and biotech "content" companies. This trend is sure to grow in the future as companies try to extract the most value from their intellectual capital. Over the past year, there also has been more interaction between academia and corporate America, with much more licensing of academia's technology, Levine says. Corporations want access to the cutting-edge technology that is coming out of universities.

"Bioinformatics should be judged on how well it can advance biology," Mintz comments. Computer results must always be validated in the wet laboratory setting. The problem is that in some cases the link between the computer and the lab is missing. A successful bioinformatics company must go the extra step to show that its computer results are biologically accurate, Mintz contends.

In spite of the concerted efforts of both large and small bioinformatics and biotech companies to deal with the continuing flood of genomic and proteomic data, much remains to be done. Algorithms must be created to decipher ever more complex patterns in data. Annotation must be improved and kept current. Disparate types of information must be integrated across multiple platforms in such a way that it is readily available to researchers. Enterprisewide systems must be developed to allow far-flung groups of researchers access to the same data and analysis.

And after the information is integrated and organized, it must be analyzed by ever more sophisticated statistical and analytical tools to quickly and accurately identify drug targets and lead compounds. For these reasons, bioinformatics is a booming field and is likely to continue to grow. Bioinformatics companies say they are up to the challenge.


Bioinformatics: What Is It?

Exactly what is meant by bioinformatics depends on whom you talk to. To most people, bioinformatics is the application of computer technologies to the biological sciences, particularly genomics, with the object of discovering knowledge.

This is often understood to include high-throughput screening of genes and proteins, according to Paul Weber, vice president of software consulting services for Tripos.

Anthony R. Kerlavage, senior director of product strategy at Celera Genomics, says bioinformatics is any application of computation to the field of biology, including data management, algorithm development, and data mining.

Historically, says Steve Gardner, vice president and chief technical officer of Viaken, the term bioinformatics has related particularly to the biological entities involved in the drug discovery process, covering genomics and proteomics. Some time ago, he says, it started to get used in a wider context, describing "everything in the discovery value chain. Everything upstream of genomics and proteomics through support for high-throughput screening, chemical information systems, clinical data, the activity of drugs in the body--all of that got lumped in." Gardner prefers to use "research informatics" to encompass this wider definition of bioinformatics.

From the biologist's point of view, bioinformatics is a set of tools that allows the scientist to see cause-and-effect relationships between disease and polymorphisms, or differences in the DNA sequence among individuals, according to Rick Sheridan, vice president of information science for PPGx.

From the viewpoint of a drug discovery company, says Arnold Hagler, president and CEO of Structural Proteomics, bioinformatics is the use of computers in assigning function to proteins and in comparing protein-protein interactions in different protein families--that is, bioinformatics is how the researcher transforms gene data to protein structures and correlates gene and protein function. Bioinformatics helps the researcher to "mine the data" in the gene sequences that have been discovered, he says.

Taking the broad view, bioinformatics is the application of software or information technology to biology and experimental medicine with the objective of discovering knowledge, says Alex Titomirov, chairman and CEO of InforMax.

An even broader view is supplied by Howard D. Goldstein, president and CEO of Entigen. Goldstein says bioinformatics is simply the management of biological information.

According to Eli Mintz, president of Compugen, the whole concept of bioinformatics will be gone in 10 or 20 years, as biology undergoes a transformation into a more data-oriented science. As the amount of biological data grows exponentially, biology will become a quantitative science, and biologists will have to start using different tools.

Included in the bioinformatics realm is the technology needed to bring together groups of researchers from academia and industry so they can collaborate more effectively. It is also necessary to bridge the gap between different types of data, so that scientists and managers have access to as much information as possible in making decisions on which paths of research to pursue or kill.

Of course, the definition is just the start. "The challenge in bioinformatics," says Kyle Hart, director of bioinformatics at DoubleTwist, "is knowing how to select and evaluate computational biology algorithms, knowing when to build your own, and knowing how to make them all work together in a high-throughput manner to produce high-quality results."


Looking At Small Molecules

Like bioinformatics, cheminformatics is still being defined. Whereas bioinformatics, in general, is the discipline in which computers are used to store, retrieve, and assist in understanding biological information, cheminformatics is the organization of chemical data in a logical form to facilitate the process of understanding and making inferences. Instead of a text-based retrieval system, cheminformatics uses chemical structures that researchers provide as input to identify similar compounds that might be screened for biological activity.

The boundary between cheminformatics and bioinformatics is disappearing, but in general it can be said that bioinformatics deals with large molecules such as proteins and cheminformatics deals with the small molecules that are synthesized in chemical processes, says Phil McHale, vice president for product marketing at MDL Information Systems. Bioinformatics blurs into cheminformatics with the processes of target identification, target validation, and assay development.

MDL is a cheminformatics company that enables its customers to store, retrieve, analyze, and make decisions based on chemical structures. It considers itself to be in an ideal position to close the gaps between the large-molecule bioinformatics world and the small-molecule cheminformatics world.

Tripos also considers itself a cheminformatics company, focusing mainly on chemical information and chemical properties for compounds involved in the early drug discovery stages through the early development stages.

Other companies competing in this field include ChemNavigator and Pharmacopeia. Over the past few years, Pharmacopeia has acquired a stable of software companies that include a bioinformatics arm--Genetics Computer Group--as well as Molecular Simulations, Synopsys Scientific Systems, and Oxford Molecular Group. These last two businesses form the cheminformatics part of Pharmacopeia.

ChemNavigator has developed an application that allows researchers to submit a precise structure for comparison with other compounds among the more than 1 million compounds in its library, President Scott Hutton says. "Fuzzy logic" is used to discern chemical similarities between structures. At ChemNavigator's website, commercially available compounds can be identified, purchased, and delivered to the researcher expeditiously. According to Hutton, this Web-based application, iResearch, includes compound background information such as Food & Drug Administration filings, toxicity data, and patent information, helping the researcher to avoid wasting resources on unsuitable candidates.

Anadys Pharmaceuticals is a young drug discovery company using cheminformatics as a platform on which to bridge the gap between chemistry and biology. In drug discovery, biology supplies the targets through genomics, chemistry provides the compounds to be screened, and assays are developed using biology. Medicinal chemists take "hits" from those screens and make more compounds to be tested by biologists in animals. It is essential to tie these processes together using informatics.

Anadys is developing an integrated cheminformatics management system to inventory compound libraries, analyze the libraries for specific chemical and druglike properties, store results from high-throughput screening, and predict new chemical structures, says Kleanthis G. Xanthopoulos, president and CEO. Once these new compounds are identified, Anadys' medicinal chemists synthesize these materials. In effect, Anadys uses cheminformatics software to prescreen compounds to identify the more active druglike compounds.

The search for new drugs cannot be done randomly, given the enormous number of possibilities, so computer algorithms are critical for efficiently screening for biologically active structures, Xanthopoulos says. Once these drugs are identified, Anadys' medicinal chemists synthesize them in small amounts for experimental screening. Once a small-molecule drug is in the lab, it can be modified as necessary to match the requirements of the drug application. Anadys' research is primarily in what it calls riboproteomics, the study of RNA-protein interactions.

Many bioinformatics companies are following courses that blur bioinformatics into cheminformatics. For example, Viaken outsources standard software products for companies in the life sciences industry. But over the past year, the company has expanded into chemical properties and structures in order to extend the range and depth of what it can do for its customers, says Steve Gardner, Viaken's vice president and chief technical officer.

[Previous Story] [Next Story]


Chemical & Engineering News
Copyright © 2001 American Chemical Society

Home | Table of Contents | News of the Week | Cover Story
Business | Government & Policy | Science/Technology
Chemical & Engineering News
Copyright © 2001 American Chemical Society - All Right Reserved
1155 16th Street NW • Washington DC 20036 • (202) 872-4600 • (800) 227-5558

CASChemPortChemCenterPubs Page