About MDD - Subscription Info
May 2001
Vol. 4, No. 5, pp 94–96.
sites and software

A river of data runs through it

opening artNew computing techniques help manage the massive flow of information.

For many years, computational methods have played a critical role in the drug discovery process. The revolution of automated laboratory methods has altered the research environment to require an even greater dependence on computational systems. Robotics, multiwell systems, and new technologies such as combinatorial chemistry, high-throughput screening (HTS), and genomic sequencing are generating truly massive quantities of data—data that are overloading the conventional computing informatics infrastructure. Deriving value from such massive quantities of data, a process called informatics, is increasingly difficult.

Database systems have come to play a central role in informatics; however, they are also part of the problem. Today’s discovery information comes from hundreds of different sources and consists of many types of data. The integration, management, and analysis of all this valuable data present tremendous challenges.

To compound the problem, the nature of database systems requires that data be centralized and research inquiries be restricted to those properties that have been preconceived or anticipated and already stored in the system. The inflexibility of these traditional systems limits analysis and exploration of the generated data. The combination of increasing volumes and types of data with the inflexibility of legacy database systems has led to a gross underutilization of drug discovery data.

Two primary problems occur when looking at how the wealth of data, generated by HTS and other methods, is being used, or more important, not being used. First, although measured screening activity can often be related to structural information, methods for deriving predictive models are generally not effective for the large quantities of data in today’s automated research environment. Second, conventional modeling technologies cannot incorporate the full breadth of useful collateral information about tested compounds, such as text annotations.

New methods of examining this wealth of data can enable better lead selection. Researchers at Agouron Pharmaceuticals, Inc. (a wholly owned subsidiary of Pfizer), are using a commercial method called data pipelining to derive activity models from their HTS data. Such an approach can be used to aid the optimal selection of possible lead compounds from third-party data collections.

How data pipelining works
Data points are processed independently while flowing through a branched network of components. This allows great flexibility in describing data processing sequences that can be run with optimal performance. This relieves the burden of having all the data sources in the same place at the same time. Very fine control over the analysis is possible by using branches to route data points to different downstream processing steps. Data pipelining complements relational database systems by extending the scope of analyses that can be performed.

In addition, advances in system performance make it possible to process entire data collections in real time. Because data are processed in real time, they are not constrained by what has been precalculated and stored in a database.

With specialized software, data processing pipelines can be composed graphically using more than 100 different configurable components. These components perform operations like calculating properties, filtering, and reading from or writing to a variety of data sources. Individual data records “flow” through these pipelines from component to component, and each component performs its designated calculation or operation.

For example, computational components can add calculated properties, divert the data records based on some specified criteria, or model incoming activity information. Data sets can be merged, compared, or processed according to the logic of the pipeline a user lays out. Once established, these computational protocols may be published for enterprise use, allowing research colleagues to apply them to their own data.

Models for predicting activity
One challenge in handling large amounts of data is the generation of useful predictive models. Traditional quantitative structure–activity relationship (QSAR) models are useful for congeneric data sets of limited size, but they are inadequate when the volume or diversity of the data increases. Further, conventional data modeling technologies such as fitting methods (regression, neural networks) and partitioning methods (decision trees) can perform poorly when given either very few samples or very large sample sets, or when large numbers of descriptors are included.

The categorization approach implemented in some data pipelining systems, like SciTegic’s Pipeline Pilot, uses a Bayesian statistical approach designed specifically to address the limitations inherent in conventional fitting methodologies, allowing it to work across all of these extremes. The streaming approach of data pipelining eliminates the typical memory and disk capacity scaling issues associated with modeling of extremely large data collections.

These statistical categorization methods extend the capabilities of data pipelining by automatically learning from data. Models are created by marking sample data that have the traits one is looking for, and the system learns to distinguish them from other background data. The application automatically determines which properties are most important and weights them accordingly.

The model is then incorporated into the system as a new component, where it can then be used in pipelines like any other component. These advantages make statistical categorization feasible in today’s high-throughput, multisourced, data-rich environment.

Hit enrichment by computational ranking
Agouron is using SciTegic’s Pipeline Pilot to derive additional value from the vast volumes of data they generate through HTS and combinatorial technology. Their first application of the product has demonstrated its ability to produce reliable activity models from screening data.

The viability of one model was explored by prioritizing samples in a test collection of approximately 10,000 compounds and examining the hit enrichment in the ranked set. A data pipeline was developed to create a model for a new biological activity (Figure 1). The molecular descriptors used in model building were SciTegic’s extended connectivity fingerprints combined with molecular weight and the number of hydrogen bond donors and acceptors. The model was created using a structural data set of 700,333 baseline compounds and 3571 confirmed hits. An evaluation data set consisted of 9566 compounds and contained 74 known actives (0.77%). The model generated was used to rank the evaluation set on the likelihood that the compounds would exhibit the desired activity. The position of known active compounds in the ordered list is an indicator of the quality of the model. A convenient way to visualize this is by plotting the cumulative distribution of the active compounds in the ranked test library.

Another useful measure of the quality of the model is the hit enrichment in the highest scoring subsets. This is defined as the ratio of the number of actives in a subset and the number of actives in the whole set, divided by the ratio of the number of compounds in the subset to the number of compounds in the whole set. Among the top 1% of the ranked set, we found 28% of the known hits, an enrichment of nearly 30-fold.

In prioritizing new uncharacterized compounds for HTS, even selecting as much as the top 10% of high-scoring compounds yields an extremely favorable enrichment over unmodeled data. Thus, screening expenses can be greatly reduced by assaying only the highest scoring compounds, or positive results can be obtained more quickly by assaying those compounds first. These types of models have proven to be extremely valuable for the prioritization of libraries and have already resulted in considerable savings in the purchasing of new compounds.

Building the model and performing the evaluation steps can be done quickly. On a single CPU Pentium III desktop computer (600 MHz), building the model based on the 700,333-compound set took 9 minutes, and evaluation of the 9566-compound library took 10 seconds. Conventional technologies lack the scalability to address problems of this magnitude with accuracy and speed.

Data pipelining applied to other areas
Data pipelining can be used in a broad variety of informatics applications beyond predictive modeling. Standard protocols are available for cleaning, evaluating, and comparing compound libraries. Chemical databases can be mined for candidate leads, while simultaneously applying outside criteria from text collections, such as Medline. In addition, assay data can be processed and condensed before archiving in a relational database system. By streaming data through automated processing protocols, new scientific approaches become possible, and discovery is ultimately accelerated.

Conclusions
Data pipelining has emerged as a practical technology for accelerating the discovery process. Research at Agouron with Pipeline Pilot has demonstrated the value of data pipelining as a tool in drug discovery research and an aid for managing large data collections in a flexible manner. The technology also shows promise in guiding the generation of new combinatorial libraries in the future. Further, data pipe lining as a technology has applicability beyond cheminformatics as discussed here, in areas such as bioinformatics, genomics, laboratory information management systems, and clinical informatics.

Acknowledgment

We acknowledge the help of Ton van Daelen at SciTegic, Inc., San Diego.


Jaroslaw Kostrowicki is a research scientist, Zhengwei Peng is a senior scientist, and Atsuo Kuki is a senior director of discovery chemistry at Agouron Pharmaceuticals, Inc./Pfizer Global Research & Development, La Jolla, CA. Send your comments or questions regarding this article to mdd@acs.org or the Editorial Office by fax at 202-776-8166 or by post at 1155 16th Street, NW; Washington, DC 20036.

Return to Top || Table of Contents