ACS Publications Division - Journals/Magazines
to Modern Drug Discovery home
FOCUS: Molecular Modeling
Feature Article

Nov/Dec 2000, Volume 3, No.9, pp. 49–50, 53-54.

Survival of the Fittest in Drug Design

By Michael J. Felton

photo of lionOne of the cornerstones is the use of genetic algorithms in producing new molecules.

Public schools may still debate the teaching of evolution, but scientists are using evolutionary theory to allow computers to invent new molecules. It may not sound like the most likely place to use evolution, but the combination may help treat disease. Computers may not appear to be very “biological”, but they can be instructed to mutate and sexually combine virtual molecules or other data. In addition to producing new molecules from various combinations of existing molecules, equations can create the evolutionary “survival of the fittest” and remove the less-fit candidates.

One of the most promising applications of this biologically based computer technique is in computer-aided molecular design (CAMD). This field has been called many things in different disciplines, but in general, it is the design of new molecules based on desired properties. In pharmaceutical development, this effort is focused on modeling the drugs and the biological receptors that the drugs bind to so that better binding, and therefore, more potent or precise drugs, can be developed. Evolutionary techniques can help achieve the design of totally new molecules, some that were never even thought of before.

Before evolutionary techniques can be applied in CAMD, two things must be known to some extent: the properties that are desired and how they relate to a molecule’s structure. The structure–activity relationship is needed to both determine the necessary properties and build a molecule that has those properties. This can be described as a structure-to-property stage and a property-to-structure stage (see Figure 1). The first stage is determining the properties of a molecule based on its structure. The second stage is building a structure based on desired properties. It’s like telling a team of Martian engineers to build a better car. First, the aliens must figure out what a car does and how it works, and then they can build a new one based on what was learned.


Figure 1. The role of evolutionary techniques is shown in the large computer-aided molecular design scheme. Source: Reference 1.


The first stage, originally called the “forward problem” as conceived by Venkat Venkatsubramanian of Purdue University (1), is essential for designing molecules by computer. Early attempts at approximating the properties of a molecule based on structure were successful for compounds with properties that varied in a direct way from their structure. However, this relationship soon becomes nonlinear and nonintuitive as more complicated molecules are modeled.

Pharmaceutical drug development using computer design techniques inherently requires complex molecular modeling. Drugs, or ligands, vary from the simple to the complex, but the receptors are usually extremely complex. Drug development is further complicated by the fact that it is a bimolecular system composed of the biological receptor and the ligand (drug or protein). Many CAMD systems choose to predict the properties of either the ligands that operate on the receptor or the receptor itself, but not usually both.

Systems that depend on the structure of a biological receptor face extreme difficulties. Many receptors are membrane-bound, making it extremely difficult to determine their three-dimensional structure by NMR or X-ray crystallography. Only a few thousand receptors are found in the protein data banks, and the structures may be based on the receptor being in water or organic solvents, which may prove to be an incorrect assumption. To complicate matters further, receptors may change shape as they bind, a process called induced fit. If this is the mechanism of how a receptor and ligand bind, then the three-dimensional structure of the unbound receptor may be of slim relevance. Three-dimensional structures can be produced in which the receptor is bound to the ligand; however, this adds significant time and complexity to the procedure.

In addition to the difficulties facing the determination of structures, the shortcomings of computational modeling abilities are apparent for both receptors and ligands. According to a patent from Axys Pharmaceuticals, Inc. (U.S. Patent 6081766; 2000), “Existing methods for constructing predictive models are unable to model steric interactions accurately, particularly when these interactions involve large regions of the molecular surface.” Likewise, quantitative structure–activity relationship (QSAR) techniques are rather accurate on a small scale, determining properties of specific regions but failing to produce an accurate global description of the molecule. Substituent models also incur problems when used on a different base molecule than the one for which the empirical data were measured.

If functional areas that might participate in binding are identified with QSAR properties, then those regions can be weighed heavily in determining the molecule’s activity. This is a pharmacophore and can be used to characterize the whole molecule. Pharmacophore models attempt to combine some of the advantages of QSAR techniques with the idea of identifying substituents.

Although the above descriptions show many possible shortcomings of the present techniques, advances in three-dimensional QSAR have led to better characterization of molecules and better calculation of their properties. Increases in computer power allow more data points to be generated for surface properties, making the current field of functions closer estimations of macroproperties than ever before.

The second half of the challenge in developing a new molecule is to take the properties that you know are important and determine the structure that will have them. This has proven much more difficult than it sounds, and it depends on the first structure-to-properties part.

Various computational techniques have been used in the past to develop structures, but many have been unsuccessful. Both strictly computational and rule-based equations have encountered difficulty. These techniques “suffer from drawbacks due to combinatorial complexity of the search space, design knowledge acquisition difficulties, nonlinear structure–property correlations, and problems incorporating higher-level chemical and biological knowledge” (1). The difficulties that scientists face in using these mathematical- and reason-based methods have led to the search for better methods of developing new structures.

Evolutionary computing
The use of evolutionary techniques to create new molecules from properties largely reduces the difficulties presented by other methods of assembly. In general, two functions are used together to generate the evolutionary technique: genetic algorithms and fitness functions. Genetic algorithms produce mutations and crossover, something like sexual reproduction, to produce new species of data. The fitness function rounds out the evolutionary system by testing all the data and eliminating the poor performers, just like survival of the fittest.

Genetic algorithms
Genetic algorithms (GAs) allow computers to generate new, random data. When combined with a fitness function, only the best data are randomly exchanged and mutated until a new generation of data is produced. As with real genes, genetic algorithms manipulate so-called genetic material, but instead of DNA, this genetic material is some other linear string of symbols. In humans, the genetic material is DNA, which is thought of in terms of its base pairs. A string of three base pairs makes up a codon, which indicates the placement of a specific amino acid within a protein. By analogy, the computer equivalent of the base pairs and codon is replaced with a symbol or numbers that can represent any structural unit. For instance, a researcher might want to alter the molecular structure of a small molecule and could use symbols to represent the atoms or functional groups of potential molecules. Then N could indicate a nitrogen atom, or [Carb] could stand for a carboxylic acid group. Often, to promote faster computing, larger units are used, like replacing atoms with functional groups rather than adding all the complexity of the entire structure.

When GAs are used in computers, the genetic material used is often different for different researchers or companies. One company may use numbers to designate substituent groups, so mutations would change what substituents are used. Others may use letters to indicate local molecular properties, such as saying P represents a region that is partially positive, and N represents something that is partially negative. Different GA techniques use different methods to convert the data into suitable genetic material; however, what happens to the material is consistent.

Figure 2. Genetic algorithms work by crossing over linear strings of data, such as the letters shown here. Source: Reference 2.


Figure 3. Similar to genetic algorithms, genetic programming crosses over tree structures. Source: Reference 2.


Figure 4. Genetic graphs manipulate molecules with two atoms represented as vertices, and the bond in between as an edge. Source: Reference 2.

The GA causes either mutation, crossover, or both. Mutation is the random changing of individual genetic symbols. For example, a string HGTZREU could have a GA that changes the R to a Q, and the new molecule would then be measured by the fitness function. Crossover is analogous to sexual reproduction, but it is a much cruder version. The GA randomly cuts the string of two different individuals in two, as seen in Figure 2. The half strings then switch so that there are two progeny, each with half from the parent. The children and the parents are then evaluated by the fitness function: The successful progeny and parents go on to be manipulated again by the GA.

In addition to GAs, there are different genetic programs that operate on data in a tree format, and genetic graphs, which are based on graph theory. Figures 3 and 4 illustrate the use of genetic programs and genetic graphs on molecules. Graph data for molecules is composed of a group of atoms—the vertices—that are held together by bonds, which are known as edges. Globis and colleagues at the National Aeronautics and Space Administration have tested this system against a set of already known compounds, including cholesterol, morphine, and diazepam.

Survival by fitness
With the proper equations, data can be tested to determine the best-performing member of any set. Then, as in nature, the fittest remain and the weak become extinct. According to Globis, “The key to any genetic software solution is a good fitness function. . . . This function must be very robust since the randomly generated initial molecules rarely make chemical sense” (2). The first step, however, is to convert the genetic representation of the molecules into the molecular structure, similar to the way proteins are made based on DNA and RNA. Once the molecules are built, the fitness function is applied. For drug discovery, the fitness function is the same as functions that estimate the docking abilities of ligands and a receptor. The function ranks the molecules; the poorest docking compounds are removed; and the remainder are modified genetically.

Fitness functions are extremely complex and tax the computers that run them. Typically, the functions must not only build the molecules, but also calculate the properties based on what was learned in the first structure-to-properties stage. Then they determine the effect when the molecule with properties attached is placed on the receptor. This is further complicated by the fact that in the three-dimensional world, there are many ways the drug molecule could approach the receptor. Some more recent programs allow for flexibility and even the presence of water between the receptor and the molecule. Unfortunately, these enhancements often take even more computing power. In essence, the entire process is limited by the speed of the fitness function.

The key advantage of genetic algorithms is that they not only use the computer as a tool for modeling and understanding properties, but also enable the computer to develop new structures and determine whether those structures serve a specific purpose. This allows discoveries of a random, almost accidental nature. Accidents made by humans have led to some of the greatest scientific discoveries of all time. Venkatsubramanian said, “One of the advantages [of evolutionary techniques] was not so much that it may have produced an explicit structure which was an immediate winner, but it forced them [the researchers] to think outside the box.” Although Venkatsubramanian was discussing engineering materials in this case, the same could be true of pharmaceuticals. Genetic algorithms show the potential for the discovery of unthought-of structures that would normally not be considered by researchers. This advantage may increase the yield of discovery, dramatically influencing the development of pharmaceuticals in the coming decades.

Companies using genetic algorithms
Currently, most major pharmaceutical companies use rational drug design and evolutionary techniques such as genetic algorithms as part of the drug discovery process. Several small pharma companies have also made innovations in this area. Companies such as Tripos, Daylight Chemical Information Systems, Axys Pharmaceuticals, and Nanodesign, among others, market products and services to small and large pharmaceutical and biotech companies that are based on genetic algorithms.

Tripos has developed several software products and services using genetic algorithms. Within its SYBYL program, the company offers a genetic algorithm-based conformational search tool for exploring the three-dimensional shapes that molecules attain. The GASP algorithm measures similarity and overlays molecules based on matches between their hydrogen bond donors/acceptors, hydrophobes, etc. The company has also been using genetic algorithms on binding problems. Flexidock uses a genetic algorithm to fit ligands into protein binding sites and accounts for conformational flexibility. Tripos is also working with collaborators to develop new tools, such as a de novo ligand design toolkit based on evolutionary principles under development at the University of Texas–Austin.

Daylight Chemical Information Systems, Inc., owns a 1993 patent on an evolutionary technique that uses SMILES strings as genetic material and performs fitness functions using comparative molecular field analysis (CoMFA). (SMILES, or simplified molecular input line entry specification, is a method of representing a molecule in a linear form.) Axys Pharmaceuticals also has a patent describing a neural network-based approach to modeling activity and designing new compounds.

Nanodesign has developed technology to provide novel chemical structures as candidates for the pharmaceutical industry. It uses a computational drug design process called Evolutionary Molecular Design (EMD). This process identifies the activity that is required for the receptor and focuses on the ligand structure rather than the receptor structure. Once the required activity has been identified for the receptor, a molecular assembler is used to identify structures that meet the requirements identified for the receptor. The most interesting facet of Nanodesign’s approach is the use of genetic algorithms to generate both the virtual receptor and the new ligands. Therefore, Nanodesign’s EMD uses genetic algorithms in both the forward and reverse problem.

Obviously, these companies are striving for new drug developments by using computational methods that achieve greater accuracy than could previously be obtained using older drug development tools.


  1. Venkatsubramanian, V., et al. Computer-Aided Molecular Design Using Neural Networks and Genetic Algorithms. In Genetic Algorithms in Molecular Modeling; Devillers, J., Ed.; Academic Press: New York, 1996.
  2. Globus, A., et al. Automatic molecular design using evolutionary techniques. Nanotechnology 1999, 10 (3), 290–299.


Return to Top || Table of Contents