Cover image: Westbound Publications and Tony Fernandez
December 2001
Vol. 31, No. 12, pp 12–17.
Starting the Process

Table of Contents

A. W. Edith Chan
Malcolm P. Weir

Using chemistry to target treatments

Protein structures bridge the gap between bioinformatics and cheminformatics, making it possible to identify promising new drug compounds earlier in the screening process.

Opening art by Scott Neitzke
Scott Neitzke

The revolution that generated the current abundance of genomics and bioinformatics data is creating new problems in drug discovery. Bioinformatics gives us a wealth of information on disease-related genes. How do we choose which genes are the best potential targets for drug development? Where can R&D be directed most effectively?

The most valuable biological targets expressed from these genes are those that are at critical intervention points in a disease pathway and at the same time are likely to be “druggable” (proteins that are likely to be susceptible to orally absorbed drugs). Selecting these protein targets from among thousands of candidates requires more sophisticated methods than have been used before because ~40% of newly identified gene sequences have no known function. If bioinformatics can relate distant members of pharmaceutically relevant protein families and annotate unassigned sequences by detecting remote homologues, it will be possible to identify potentially druggable target families on the basis of structural considerations. Technology, such as Biopendium (a compilation of information on protein structures, sequences, and ligands) from Inpharmatica (London), can give us novel sequences with annotated functional and structural information (1). However, the problem of how to prioritize hundreds or thousands of the potential targets discovered is not easily solved.

Identifying and designing druglike molecules to cure diseases, usually by inhibiting the action of specific proteins, is a quest that all pharmaceutical companies pursue. Combinatorial synthetic methods combined with high-throughput screening can rapidly provide large numbers of potential lead compounds. However, the screening collections built up in this way are not the most effective method for deriving potential drug compounds, many of which will fail in the subsequent stages of the drug discovery cycle. Most pharmaceutical companies have implemented in vitro ADME (absorption, distribution, metabolism, and elimination) screens to help discard discovery-phase compounds that are likely to fail farther down the line. To maximize the chances for generating promising leads from a given compound collection, the constituents of the collection should be as chemically diverse as possible. However, any measure of chemical diversity is dependent on the parameter being considered, and this complicates the comparison process.

The Biopendium can annotate sequences and relate them to a known protein structure, even if it is distantly related, using Informatica’s proprietary technology, such as the Genome Threader algorithm. The known structure constitutes the bridge between the bioinformatics knowledge base of the Biopendium and Inpharmatica’s cheminformatics database, Chematica. Chematica provides the mechanism for going from a target protein structure to viable drug ideas (Figure 1). Some of the features of this database are

  • the absolute quantitative prediction of a protein’s potential binding sites;
  • definition of the chemical features that determine the molecules that will bind to a site;
  • generation of pharmacophores for large-scale virtual screening;
  • clear predictions of what molecular fragments are likely to bind to the active site, generating ideas for a focused chemical library; and
  • prioritization of druggable targets.

Active site identification
When considering the 3-D structure of a protein target, the first important question is, “Where in the structure is the protein’s active site?” The answer depends largely on the protein’s biological function.

Inpharmatica’s method first uses the SURFNET program to locate internal cavities and surface clefts by computationally “filling” all void spaces between the atoms in the protein structure with spheres (2). Clusters of these spheres define empty regions, one or more of which may correspond to the protein’s binding site.

For structures already in the Protein Data Bank (PDB) (3), or for proprietary in-house structures, the potential binding sites are precalculated and stored in the Chematica database. The binding sites are generated as 3-D density maps and can be viewed using standard molecular graphics packages.

Characterizing the binding site
Once the binding site has been identified, and its size and shape clearly delineated, the next step is to determine what molecules might bind there, with a view to designing a molecule that might act as a drug against the associated disease.

The algorithm XSITE identifies regions within the binding site where particular atom types (carboxyl oxygens, main chain nitrogens, chlorine, fluorine, sulfonyl sulfur, etc.) can make favorable interactions with the nearby atoms of the protein (4). The data on which XSITE is based are derived empirically, and the program makes no assumptions about the nature of the forces that exist between atoms. The program relies entirely on the compilation and use of a database of 3-D distributions of different atom types about different three-atom fragments. The 3-D distributions are taken from known protein and protein–ligand structures in the PDB and can be recompiled as the size of the PDB grows.

Various properties of each cleft are then compiled to help automatically identify the binding site. These properties include the cleft volume, the hydrophobic and polar contributions, and the number of hydrogen bond acceptors of all protein atoms within the predicted binding sites. In all, more than 20 of these parameters are calculated to generate a complicated 3-D mosaic of favorable interaction regions.

Pharmacophores and binding fragments
For this mosaic to be useful, it must be simplified and reduced to its key features. One way to do that is to generate a pharmacophore—a collection of the most important chemical features characterizing molecules (including any potential drugs) that might bind in the binding site. The pharmacophore provides a molecular “search template” to scan for potential drug leads in commercial or proprietary small-molecule databases and combinatorial chemical libraries. Such “virtual screening” promotes the fast discovery of potential drug candidates.

Chematica can search small-molecule databases and pick out the molecules that best match the pharmacophore, both in terms of the shape of the binding site and the chemical properties of the atoms at different points within the site. The program currently generates a search template automatically for use with the Unity platform (Tripos, St. Louis).

Inpharmatica’s approach differs from others in that it does not view the interaction from the binding molecule’s point of view, but rather from that of the target protein itself, so as not to miss any interactions required by the protein.

As an alternative to searching a database of complete molecules, Chematica also has a built-in database of more than 100 small druglike fragments and scaffolds, including carbonyl, sulfonyl, imidazole, and diazepine groups. These represent common organic functional groups and building blocks used in chemistry. Any proprietary structures can be added to this database easily for search purposes.

The fragments in the database are considered in turn with a large number of orientations. The XSITE score is calculated at each rotation. The results summarize the highest scoring fragments and how they populate the protein’s binding site.

The fragment distributions within the binding site are useful for providing chemists with ideas on the sorts of molecules that are likely to bind there, and they can be used for searching small-molecule databases for candidate binding molecules. They can also be the starting points for designing focused chemical libraries targeted for specific proteins.

Substructural search templates can be generated by combining different fragments and constraining the distances between them. The program generates search queries for Unity and ISISBase (MDL, San Leandro, CA). It also is possible to combine the pharmacophore search templates with these substructural searches to carry out virtual screening against commercially available compound libraries.

Selecting drugs and targets
A drug molecule is usually designed to act on a specific protein target. So, while chemists work on designing chemically diverse drug candidates, it is the task of the biologists and bioinformaticians to discover new targets against which to test these compounds. However, with the explosion of data from the various genome projects, the numbers of potential targets are likely to explode also. It is clear that an efficient and effective process is urgently required for filtering out the druggable targets from those for which a successful drug is unlikely to be ever designed. In this way, the experimental effort can be directed more efficiently. Only the druggable targets need to be screened against the diverse molecular collections, and these are then less likely to fail the ADME stage of the drug design cycle.

A target protein must be identified before a search for a drug molecule can begin. At present, protein targets for which successful drugs have been developed include proteases, kinases, G protein-coupled receptors, and nuclear hormone receptors. These are considered “drug-friendly” protein families because some members have been successfully targeted, although there are also many failures. Thus, many pharmaceutical companies are searching the newly sequenced genomes for novel members of these families, hoping for rich pickings. For example, the total number of human proteases has been estimated to be between 500 and 1100 (5). Alternatively, entirely novel targets can be identified by taking a general approach to evaluating druggability. Identifying homologues of a selected target makes it possible to evaluate potential selectivity problems and analyze multicomponent pharmacological systems to be analyzed.

The traditional drug design process is largely driven by the instincts, intuition, and experience of pharmaceutical scientists. For most drugs, oral ingestion is the preferred route of administration. Researchers have therefore sought to delineate the physicochemical properties that favor intestinal absorption and to develop computational methods for predicting these properties. The best known method is the “Rule of Five” (ROF) devised by Lipinski and co-workers at Pfizer (6). They used experimental and computational approaches to estimate solubility and permeability in known drugs. The ROF predicts that poor absorption or permeation is more likely for molecules having >5 H-bond donors, 10 H-bond acceptors, a molecular weight (MW) > 500, and ClogP > 5, or MlogP > 4.15. ClogP is a partition coefficient that indicates a molecule’s hydrophobicity, and MlogP is an octanol–water partition coefficient. ROF has been widely accepted in the pharmaceutical industry as defining the limiting properties of the most orally active compounds and hence, being a measure of good drug candidates.

Is it possible to relate the ROF properties of small molecules to the properties of protein–target binding sites? What physical or chemical properties of the binding sites could have direct relationships to the small molecules that are most likely to bind to these sites? Some properties may be easier to understand. For example, the MW of ROF compounds is small, implying that a suitable binding site in the protein structure might be a small pocket. Could it be that there are similar guidelines that can spot which proteins are good targets?

Rule-based filtering
The results generated by Chematica can provide an insight into these three questions. The information of the binding site can be interpreted as the complementary binding molecular surface required by the ligand, substrate, or drug of a given protein. The 3-D mosaic serves as a minimum requirement of the required binding substance.

We have empirically derived a set of rules from the Chematica database based on a set of known 3-D protein structures that have known binding sites. We used the parameters that defined the binding sites and computationally derived a decision tree that can be used to determine whether a given binding site is druggable. A test protein target can be passed through this empirically derived decision tree, and the result will predict whether the target is druggable or not (that is, whether it is likely to bind ROF compounds).

Applying these rules to a selection of known targets shows that they are able to differentiate between good druggable targets and poor ones. Hence, the rules predict HMG-CoA reductase, HIV-1 reverse transcriptase non-nucleotide binding site, HIV-1 protease, and the catalytic domain of tyrosine protein kinase to be viable targets. Indeed, these compounds, which include Pravachol and Lipitor for HMG-CoA reductase; Sustiva for HIV-1 reverse transcriptase; and Viracept for HIV-1 protease, are all commercially launched or are in clinical trials.

On the other hand, the rules suggest, for example, that the SH2 and SH3 domains of tyrosine kinases are likely to be poor targets, and this appears to be correct. Despite more than 15 years of research effort spent on SH2 and SH3, there are only two compounds in preclinical testing for SH2 and only one compound (with MW > 1000) for SH3.

Molecular structure diagrams make it readily apparent which proteins contain pockets with the right sizes and chemical binding features (pharmacophores) to accommodate drug molecules (Figure 2). A small molecule fits very well in the kinase domain of tyrosine kinase and HMG-CoA reductase. In contrast, the SH2 and SH3 domains of the kinase have relatively small binding sites on the protein surfaces, making them poor targets according to the druggability criteria.

Further refinement and development of these druggability rules are expected to lead to more precise discrimination between highly druggable targets and those that are suboptimal from a medicinal chemistry perspective.

Bioinformatics platforms such as the Biopendium can generate a flood of potential drug targets. By using related protein structures as stepping stones to cheminformatics analysis, it is possible to prioritize the targets. Those targets considered most likely to be susceptible to orally active drugs should be considered for further research.


  1. Fagan, R.; Swindells, M.; Overington, J.; Weir, M. Trends Biochem. Sci. 2001, 26, 213–214.
  2. Laskowski, R. A. J. Mol. Graphics 1995, 13, 323–330.
  3. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic Acids Res. 2000, 28, 235–242.
  4. Laskowski, R. A.; Thornton, J. M.; Humblet, C.; Singh, J. J. Mol. Biol. 1996, 259, 175–201.
  5. Southan, C. FEBS Lett. 2001, 498, 214–218.
  6. Lipinski C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Adv. Drug Delivery Rev. 1997, 23, 3–25.

A. W. Edith Chan is head of molecular design at Inpharmatica Ltd. (60 Charlotte St., London W1T 2NU; 44-20-7074-4642;; Her Ph.D. study was on theoretical, solid-state, and physical chemistry in Roald Hoffmann’s laboratory at Cornell University (Ithaca, NY). She has 10 years’ pharmaceutical experience in the areas of molecular modeling, structure-based drug design, combinatory chemistry, and cheminformatics. She worked on peptide and protein chemistry at Italfarmaco (Milan, Italy). After that, she joined Selectide, Hoechst Marion Roussel’s combichem center, and was involved in various drug discovery programs and the design and analysis of combichem libraries. At present, she is developing Chematica, a database that bridges bioinformatics and cheminformatics.

Malcolm P. Weir is chief executive officer of Inpharmatica Ltd. ( He obtained his Ph.D. in chemistry at Imperial College (London). He joined Glaxo as an expert in protein structure and function and eventually became the director of the Molecular Sciences Division. At Glaxo Wellcome, he pioneered the application of structural biology to drug discovery, resulting in the advancement of several clinical drug candidates. He played a leading role in developing that company’s global scientific computing and proteomics strategies and managed the target identification portfolio at Glaxo Wellcome U.K. Research. At present, he serves as an expert adviser on structural biology for the Biotechnology and Biological Sciences Research Council and several universities. He was elected Visiting Professor of Biochemistry at Imperial College (London) in 1997 and was appointed CEO at Inpharmatica in June 2000.

Return to Top || Table of Contents