ACS Publications Division - Journals/Magazines
to Modern Drug Discovery home
FOCUS: Molecular Modeling
Feature Article

Nov/Dec 2000, Volume 3, No.9, pp. 41–42, 45-46.

Going For Fold In Asilomar

ByLeela Holliman

medalsComputational Biologists Test Their New Algorithms For Predicting Protein Structure.

The Olympic dozen look down from their lofty lair and smile. The shots have been put away, the hammers thrown in the closet, and the swim caps, goggles, and nose plugs locked in their lockers. The medals have been awarded, the anthems have been played, and the athletes have returned home to a hero’s welcome. The 27th quadrennial Olympiad of modern times has ended, and life slowly returns to normal in the coastal city of Sydney.

As you read this article, another aspiring panel—scientists, not gods—look down on the idyllic Asilomar Conference Grounds and smile. The probabilities have been calculated, the predictions logged, and the fingers crossed. The results will be announced, the algorithms will be praised, and the computational athletes will return to their labs to rejoice. The 4th biennial Critical Assessment of Techniques for Protein Structure Prediction (CASP4) is coming to a close, and life is anything but normal in the coastal city of Pacific Grove, CA.

Opening ceremonies
The mantra of the postgenome era has become“All of the sequences in the world won’t do me any good if I don’t know what they do.” Presently, the biggest challenge in managing the data generated by the genome projects is cataloging and cross-referencing it in an array of databases, to provide an efficient way to access or identify a given sequence.

However, the pharmaceutical and biotech industries are now starting to ask more pressing questions: What does this sequence do in the cell? What does its protein look like? What effects do mutations have? Can we design a drug that will affect it? This last question is the one that is driving industry to explore structure prediction.

Determining the answers to these questions and many others is a painstakingly slow process that is often limited by the calculation of the protein’s structure. Currently, protein structures are determined by X-ray crystallography or NMR spectroscopy. But the inherent limitations of these techniques can result in a period of months or even years between a laboratory obtaining the amino acid sequence and solving the protein’s structure. Researchers began looking for a better way.

As protein structures were solved, certain patterns were observed, such as the location of hydrophobic residues within the protein’s core and hydrophilic residues on its surface. Similarities were also noticed between proteins that performed the same function in different organisms. Guided by protein structures, researchers developed computer programs to analyze a sequence and apply patterns and rules to it to predict a structure.

This method, although promising, proved more difficult than originally thought. Within a few years, the first reports of solving this “protein folding problem” were published. The problem hadn’t been solved, however. A pattern emerged with papers containing similar claims that, upon investigation, were not provable. A way to verify these claims was needed.

Figure 1. Schematic representation of Caco-2 permeablility assay.

Rosetta.

David Baker and his group at the University of Washington in Seattle take a two-step approach to structure prediction. The group initially predicts local folds using threading or multiple alignments, and then links these domains using energy functions calculated from criteria such as hydrophobic burial, backbone hydrogen bonding, and electrostatics. Baker has romantically named his method Rosetta (Simons, K. T., et al. Proteins Suppl. 1999, 3, 171–176).

Together, John Moult of the University of Maryland Biotechnology Institute (Rockville) and Jan Pedersen of Acadia Pharmaceuticals (Glostrup, Denmark) tackled this problem and came up with CASP, first held in December 1994. CASP was designed not as a competition but as a large-scale experiment. Its purpose was to identify effective methods of structure prediction, determine any bottlenecks to progress, and form collaborations with other researchers. Critical to its spirit, entrants were encouraged to focus on the problem of structure prediction rather than on competition.

The overall format of CASP has remained consistent since 1994. Between 30 and 45 amino acid sequences for soon-to-be-solved experimental structures, generously donated by NMR and X-ray labs, are listed, and researchers are invited to submit theoretical models of each protein’s structure. Independent assessors then judge the predictions by comparing them with the solved structure. The results are tallied and subsequently announced at the CASP meeting.

Although the format has not changed significantly since its inception, the number of participants and predictions certainly has, rising almost exponentially over the years—from 136 models at CASP1 to more than 11,000 at CASP4. Another change is in scoring. Some participants believed that the scoring techniques used in CASP1 heavily favored predicted folds that, although possible, were nonetheless wrong. Beginning with CASP2, scores were based more heavily on quantifiable measurements, such as how well each residue fits within the entire model and local folds, and how accurately the predicted structure contacts neighboring residues. Additional changes were implemented in CASP3, and organizers thought that these methods best determined the accuracy and fit of a predicted model to the structure.

The participants
At CASP, the athletes are computational algorithms that determine the probabilities of, and give weights to, residues modeled against one or more aligned sequences and structures. What follows gives a flavor of some of the questions asked at CASP and how some groups have attempted to answer them.

Blasted alignments. For many researchers, the first step in identifying a protein is comparing its sequence with other sequences in the databases. Similarly, determining a protein’s structure from its sequence requires comparison with other sequence–structure relationships. Much of the problem stems from deciding the cutoff points of sequence identity and similarity, and how many sequences to compare.

Figure 1. Schematic representation of Caco-2 permeablility assay.

Threading model of MOb2.

Stephen Bryant’s group at the National Center for Biotechnology Information in Bethesda, MD, suggested that the mouse obese gene product was a cytokine by threading its sequence onto an interleukin-2 backbone (blue worms with yellow gaps). The tubes indicate pairwise contact energies between Ca residues, where thick magenta is most favorable (Madej, T.; Boguski, M. S.; Bryant, S. H. FEBS Letters 1995, 373, 13–18). (Courtesy of Stephen Bryant.)

The Basic Local Alignment Search Tool (BLAST) is a set of similarity search programs designed to explore all of the available sequence databases. Unlike programs that look for global alignments, BLAST uses a heuristic algorithm that seeks local alignments, detecting relationships among sequences with only isolated regions of similarity (1). Specific BLAST variations make allowances for sequence insertions and deletions (Gapped-BLAST), allow iterative cycles of searching (PSI-BLAST), and allow the predefinition of sequence motifs within local alignments (PHI-Blast) (2, 3).

To improve the accuracy of multiple alignments, Kevin Karplus and colleagues at the University of California–Santa Cruz developed SAM-T98, which used hidden Markov models to assign probabilities of amino acid position and identity in database searches (4). Each residue position in the alignment model has three states—matching, insertion, and deletion—and neighboring positions are connected by an associated transition probability, the chance of going from one state to the other. Thus, one can ask whether a given residue will score higher at the end of one sequence or the beginning of the next.

Secondary considerations. When sequence identities fall below 25%, the jump from primary sequence to tertiary structure is difficult. Instead, attention focuses on the secondary structural elements that make up the protein of interest, in the belief that structure is conserved even when sequence isn’t. Since the 1960s, groups have developed simple secondary structure prediction methods that were based on the statistical probability of a given residue in a given context forming part of a specific fold (5). Although these methods were amazingly accurate, they looked solely at the sequence in question. Recently, however, expanding sequence databases have enhanced prediction accuracy through the alignment of multiple sequences.

David Jones and colleagues at Brunel University in London developed PSIPRED, a prediction method that incorporates two neural networks to analyze PSI-BLAST results (6). The same group also developed GenTHREADER, a fold recognition program that, when applied to the Mycoplasma genitalium genome, predicted that >50% of coding regions showed a significant relationship to known structures. Unlike most threading methods, however, GenTHREADER uses potential evolutionary relationships to filter out false positive predictions (7).

Tertiary period. But, of course, the name of the game is three-dimensional structure prediction. Although a drug may be designed against a small portion of a protein, it is difficult to model the drug’s effectiveness without a firm understanding of the whole target.

Andrej Šali and colleagues at Rockefeller University in New York created MODELLER, a program that generates three-dimensional structures using spatial restraints (8). Initial restraints include information from homologous structures, such that if the aligned protein has a hydrogen bond between two specific residues, then the sequence being modeled can be expected to have a similar bond between the equivalent residues. Once the initial model has been built, it is tested with other restraints, including experimental data and basic chemical bonding rules.

Threading, alluded to earlier, is a mechanism to address
Figure 1. Schematic representation of Caco-2 permeablility assay.

Energy minima in protein folding.

Anfinsen’s thermodynamic hypothesis suggests that the three-dimensional structure of a native protein under physiological conditions is one in which the free energy of the system is the lowest (Anfinsen, C. B. Science 1973, 181, 223–230). Thus, while proteins may sample other local minima during folding, there is only one true minimum-energy fold.

the alignment of two sequences that have <30% identity and are typically considered nonhomologous. Essentially, one fits—or threads—the unknown sequence onto the known structure and evaluates the resulting structure’s fitness using environment- or knowledge-based potentials. This was the basis of David Jones’s THREADER method (9).

From scratch. As we have seen, many model builders use structural and biochemical information to enhance and test their models, but there is a subset of researchers who are adamant that all of the information required for proper folding is contained in the amino acid sequence. They argue that newly synthesized polypeptides are unlikely to require a sibling protein before knowing how to fold correctly. This intrepid group of researchers is looking at ab initio folding.

Michael Levitt and his group at Stanford University School of Medicine (Stanford, CA) use a two-step process. Initially, they generate a large number of potential folds using various algorithms, in the hope that some of them will approximate the native fold. They then discriminate among the folds chosen most often by looking for problems in parameters such as distance geometry and hydrophobic packing (10).

Harold Scheraga and his group at Cornell University in Ithaca, NY, are more daring. They are determined to derive protein structures strictly from energy functions. To simplify the process, they first represent the individual residues as single interaction sites, which simplifies the energy calculations. Upon creating an initial conformation, they return the residue’s atoms and look for a distinct backbone fold using their conformational space annealing method (11). This method allowed the group to locate the native folds of several proteins within the subsets of low-energy conformations.

The finish line
Although progress has been made between CASP1 and CASP3, success is coming at a modest pace. Fold recognition has advanced, with more participants successfully identifying an increasing number of folds and more groups producing good-quality models for moderately difficult targets. Unfortunately, false positives, in which programs incorrectly identify similarities between proteins, are an ongoing concern. In three-dimensional structure modeling, loop prediction, side-chain accuracy, and structure alignment have improved. And the problem of unambiguously identifying one true fold instead of a list of possibilities must be addressed. After CASP3, the assessors observed a limit on the ability of current methods to recognize remotely related folds. But strides were made in the ab initio category, with several CASP3 groups producing “reasonably accurate” models of fragments around 60 residues in length. How things have progressed since CASP3 will be determined by the CASP4 results, which will be announced in December at Asilomar and in a special issue of Proteins.

Net results
For more information on the methods mentioned in the article, visit the Web sites of the developers .

Opening ceremonies CASP;
http://PredictionCenter.llnl.gov.

Blasted alignments NCBI; BLAST/Gapped-/PSI-/PHI-;
www.ncbi.nlm.nih.gov/Tools/index.html.

Kevin Karplus, developer; SAM-98;
www.cse.ucsc.edu/research/compbio.

Secondary considerations David Jones, developer; PSIPRED;
http://insulin.brunel.ac.uk/psipred.

David Baker, developer; Rosetta;
http://depts.washington.edu/bakerpg.

Tertiary period David Jones, developer; Threader;
http://insulin.brunel.ac.uk/threader/threader.html.

Andre Sali, developer; MODELLER;
http://guitar.rockefeller.edu.

Stephen Bryant, developer;
www.ncbi.nlm.nih.gov/Structure/RESEARCH/ res.shtml.

From scratch Michael Levitt, developer;
http://csb.stanford.edu/levitt.

Harold Scheraga, developer;
www.chem.cornell.edu/department/ Faculty/Scheraga/scheraga.html.

The cheers, tears, and jeers

CASP is a unique opportunity for a community-wide experiment. It has provided an objective test of researchers’ prediction methods and has fostered numerous collaborations. It is especially valuable to newer members of the field. Along with the positive aspects of CASP, however, there are some negative ones.

Stephen Bryant, who did not participate in CASP4, expressed his concern that too much time was being spent on method development and not enough on practical applications. “Not only are [the participants] duplicating each others’ efforts with respect to trying to predict the structures of these targets, but they are also toy problems because these are proteins where the real structure is about to be known, and so anything you predict is not going to be of any practical use,” he offers. “So it’s useful only for methodology testing; it’s not useful in terms of advancing scientific knowledge.”

Kevin Karplus disagrees. Although admitting that participation takes a lot of time and effort, he believes that the conference is tremendously valuable and adds a level of reality to the research that is not found in the literature. “It’s nice to have somebody outside saying ‘Here is a structure for blind prediction. If you’re doing prediction, participate in this or we’re not going to listen to you,’” Karplus says. “There’s been a large exchange of ideas. People look at methods that might not otherwise have been considered because they were outside the particular methods they were using.”

How close are these computational groups to producing an algorithm that will accurately and consistently predict protein structures from DNA sequences located in databases throughout the world? Don’t expect this method to replace experimental structure techniques just yet. However, the CASP organizers believe that the current prediction techniques can aid in the design of experiments and also provide some level of insight into a protein’s molecular function.

References

  1. Altschul, S. F., et al. J. Mol. Biol. 1990, 215, 403–410.
  2. Altschul, S. F., et al. Nucl. Acids Res. 1997, 25, 3389–3402.
  3. Zhang, Z., et al. Nucl. Acids Res. 1998, 26, 3986–3990.
  4. Park, J., et al. J. Mol. Biol. 1998, 284, 249–251.
  5. Schulz, G. E.; Schirmer, R. H. Prediction of Secondary Structure from the Amino Acid Sequence. In Principles of Protein Structure; Cantor, C. R., Ed.; Springer-Verlag: New York, 1979; pp 108–130.
  6. Jones, D. T. J. Mol. Biol. 1999, 292, 195–202.
  7. Jones, D. T. J. Mol. Biol. 1999, 287, 797–815.
  8. Šali, A.; Blundell, T. L. J. Mol. Biol. 1993, 234, 779–815.
  9. Threader (http://insulin.brunel.ac.uk/threader/threader.html).
  10. Samudrala, R., et al. Proteins Suppl. 1999, 3, 194–198.
  11. Lee, J.; Liwo, A; Scheraga, H. A. Proc. Natl. Acad. Sci. 1999, 96, 2025–2030.

Return to Top || Table of Contents