|
![]() |
|||||||
![]() |
A river of data runs through it |
|||||||
![]() For many years, computational methods have played a critical role in the drug discovery process. The revolution of automated laboratory methods has altered the research environment to require an even greater dependence on computational systems. Robotics, multiwell systems, and new technologies such as combinatorial chemistry, high-throughput screening (HTS), and genomic sequencing are generating truly massive quantities of datadata that are overloading the conventional computing informatics infrastructure. Deriving value from such massive quantities of data, a process called informatics, is increasingly difficult. Database systems have come to play a central role in informatics; however, they are also part of the problem. Todays discovery information comes from hundreds of different sources and consists of many types of data. The integration, management, and analysis of all this valuable data present tremendous challenges. To compound the problem, the nature of database systems requires that data be centralized and research inquiries be restricted to those properties that have been preconceived or anticipated and already stored in the system. The inflexibility of these traditional systems limits analysis and exploration of the generated data. The combination of increasing volumes and types of data with the inflexibility of legacy database systems has led to a gross underutilization of drug discovery data. Two primary problems occur when looking at how the wealth of data, generated by HTS and other methods, is being used, or more important, not being used. First, although measured screening activity can often be related to structural information, methods for deriving predictive models are generally not effective for the large quantities of data in todays automated research environment. Second, conventional modeling technologies cannot incorporate the full breadth of useful collateral information about tested compounds, such as text annotations. New methods of examining this wealth of data can enable better lead selection. Researchers at Agouron Pharmaceuticals, Inc. (a wholly owned subsidiary of Pfizer), are using a commercial method called data pipelining to derive activity models from their HTS data. Such an approach can be used to aid the optimal selection of possible lead compounds from third-party data collections. How data pipelining works In addition, advances in system performance make it possible to process entire data collections in real time. Because data are processed in real time, they are not constrained by what has been precalculated and stored in a database. With specialized software, data processing pipelines can be composed graphically using more than 100 different configurable components. These components perform operations like calculating properties, filtering, and reading from or writing to a variety of data sources. Individual data records flow through these pipelines from component to component, and each component performs its designated calculation or operation. For example, computational components can add calculated properties, divert the data records based on some specified criteria, or model incoming activity information. Data sets can be merged, compared, or processed according to the logic of the pipeline a user lays out. Once established, these computational protocols may be published for enterprise use, allowing research colleagues to apply them to their own data. Models for predicting activity The categorization approach implemented in some data pipelining systems, like SciTegics Pipeline Pilot, uses a Bayesian statistical approach designed specifically to address the limitations inherent in conventional fitting methodologies, allowing it to work across all of these extremes. The streaming approach of data pipelining eliminates the typical memory and disk capacity scaling issues associated with modeling of extremely large data collections. These statistical categorization methods extend the capabilities of data pipelining by automatically learning from data. Models are created by marking sample data that have the traits one is looking for, and the system learns to distinguish them from other background data. The application automatically determines which properties are most important and weights them accordingly. The model is then incorporated into the system as a new component, where it can then be used in pipelines like any other component. These advantages make statistical categorization feasible in todays high-throughput, multisourced, data-rich environment. Hit enrichment by computational ranking The viability of one model was explored by prioritizing samples in a test collection of approximately 10,000 compounds and examining the hit enrichment in the ranked set. A data pipeline was developed to create a model for a new biological activity (Figure 1). The molecular descriptors used in model building were SciTegics extended connectivity fingerprints combined with molecular weight and the number of hydrogen bond donors and acceptors. The model was created using a structural data set of 700,333 baseline compounds and 3571 confirmed hits. An evaluation data set consisted of 9566 compounds and contained 74 known actives (0.77%). The model generated was used to rank the evaluation set on the likelihood that the compounds would exhibit the desired activity. The position of known active compounds in the ordered list is an indicator of the quality of the model. A convenient way to visualize this is by plotting the cumulative distribution of the active compounds in the ranked test library. Another useful measure of the quality of the model is the hit enrichment in the highest scoring subsets. This is defined as the ratio of the number of actives in a subset and the number of actives in the whole set, divided by the ratio of the number of compounds in the subset to the number of compounds in the whole set. Among the top 1% of the ranked set, we found 28% of the known hits, an enrichment of nearly 30-fold. In prioritizing new uncharacterized compounds for HTS, even selecting as much as the top 10% of high-scoring compounds yields an extremely favorable enrichment over unmodeled data. Thus, screening expenses can be greatly reduced by assaying only the highest scoring compounds, or positive results can be obtained more quickly by assaying those compounds first. These types of models have proven to be extremely valuable for the prioritization of libraries and have already resulted in considerable savings in the purchasing of new compounds. Building the model and performing the evaluation steps can be done quickly. On a single CPU Pentium III desktop computer (600 MHz), building the model based on the 700,333-compound set took 9 minutes, and evaluation of the 9566-compound library took 10 seconds. Conventional technologies lack the scalability to address problems of this magnitude with accuracy and speed. Data pipelining applied to other areas Conclusions Acknowledgment We acknowledge the help of Ton van Daelen at SciTegic, Inc., San Diego.
Jaroslaw Kostrowicki is a research scientist, Zhengwei Peng is a senior scientist, and Atsuo Kuki is a senior director of discovery chemistry at Agouron Pharmaceuticals, Inc./Pfizer Global Research & Development, La Jolla, CA. Send your comments or questions regarding this article to mdd@acs.org or the Editorial Office by fax at 202-776-8166 or by post at 1155 16th Street, NW; Washington, DC 20036. |