|[Previous Story] [Next Story]
ANY NEW PROTEOMICS TECHNIQUES OUT THERE?
Genome project head says many technologies for a human proteome project are not yet ready
STU BORMAN, C&EN WASHINGTON
Few people would claim that the Human Genome Project, the international effort to sequence the human genome, was just a walk in the park. Nevertheless, proteomics--the study of all proteins expressed by a genome--"is the really hard part" of understanding how organisms work at a molecular level, says Francis S. Collins, director of the National Human Genome Research Institute (NHGRI) and a leader of the Human Genome Project.
"The work that has been done with genome sequencing may turn out to have been trivial by comparison with the challenge we now face trying to understand proteins on a grand scale," he says.
The precise meaning of "proteomics" is still a bit cloudy. But what's certain is that proteomics is distinctly different from genomics.
"The genome is static. It's the blueprint, and no one lives in the blueprint of their house," explains Ruth A. Van Bogelen, head of genomics and proteomics at Pfizer, Ann Arbor, Mich. "Proteomics tells us about the biology--how it feels to live in the house: Does the heat and air conditioning work? Is there enough water pressure? Will the house stand up in a tornado? And so on." Proteomics studies include "what cells and organisms do, how they respond, what disease looks like, how drugs reverse disease--all the parts of cells and how they work."
"PROTEOMICS IS many different things," says Leroy Hood of the Institute for Systems Biology, Seattle, who in the 1980s led a California Institute of Technology team that invented the automated gene sequencer. "It's being able to identify complex mixtures of proteins; it's being able to quantitate their behavior in interesting biological systems; it's looking at their modifications and how that changes their biological activity; it's looking at their interactions, their compartmentalization, their turnover times--all of these kinds of things."
Collins says he likes the proteomics definition given by genetics and medicine professor and Howard Hughes Medical Institute investigator Stanley Fields of the University of Washington, Seattle, in Science [291, 1221 (2001)]: "the analysis of complete complements of proteins. Proteomics includes not only the identification and quantification of proteins, but also the determination of their localization, modifications, interactions, activities, and, ultimately, their function." To this definition, Collins adds structural genomics, the rapid 3-D structure determination of large numbers of proteins (on a proteome scale).
But a major concern is that "we don't have nearly enough mature technologies for each of these various categories to be confident that we're really ready to tackle this for a proteome the size of a human," Collins notes. "What we really need is a very heavy emphasis on developing additional analytical tools and on providing databases to allow integration of these very complicated data sets so we can derive real wisdom" about proteomics.
"What we have to do in proteomics is attack each of these fundamental problems in a deep way with high-throughput technologies," Hood adds. "I would argue that the first two technologies that are really important to get well in hand obviously are those that deal with identification and quantitation."
Collins, Van Bogelen, and Hood spoke at the American Chemical Society ProSpectives conference "Defining the Proteomics Agenda," held last month in Leesburg, Va. The conference was chaired by Van Bogelen; pediatrics and communicable diseases professor Samir M. Hanash of the University of Michigan School of Medicine, Ann Arbor; biochemistry professor Joseph A. Loo of the University of California, Los Angeles; and Frederick C. Neidhardt, vice president for research of the University of Michigan, Ann Arbor.
THE NEED FOR new proteomics technologies is a case of history repeating itself, Collins says, in that key technologies also were not ready when the Human Genome Project began in 1991. "The focus in the first six years of the genome project was on sequencing genomes of model organisms that were smaller and more tractable and on improving the technology and driving the cost down," Collins says. If you had checked on the status of the project in March 1999, "you wouldn't have been all that impressed," because at that point "only 15% of the genome sequence had been derived by the international consortium and placed in public databases."
But that was also the point at which sequencing technology development was coming to fruition. When faster sequencers were developed, "it really was possible to make things go rapidly," Collins says. By May 2000, 90% of the sequence was in hand.
"Many people then said, 'Okay, I guess that era is over. We're now in the postgenomic era.' I don't really think that's a great term," Collins says. "We were in the pregenomic era when we didn't have the genome. Now we have it, so I think we're in the genomic era, aren't we? I don't think we're quite to the 'post' part, and we won't be to the 'post' part for a long time."
Currently, the sequence coverage exceeds 95%. "We aim to, of course, get that all finished and completed and get the gaps closed and have a sequence that will bear the test of time by spring 2003," Collins says.
A comparable human proteome project does not currently exist, but efforts are being made to organize one. For example, the Human Proteome Organization (HUPO) has been established to "consolidate national and regional proteome organizations into a worldwide organization," "encourage the spread of proteomics technologies," and "assist in the coordination of public proteome initiatives," according to its mission statement. Hanash is the first president of HUPO.
A NUMBER OF SMALLER scale but nevertheless ambitious genomics- and proteomics-related projects are also being pursued. "With regard to comparative genomics, do not believe--in case the rumor has reached your ears--that we're not going to be sequencing anymore," Collins says. "It's not true." For instance, the mouse genome will reach fivefold to sixfold coverage by the end of December. (Sequences are determined multiple times to improve their accuracy.) Other ongoing sequencing efforts include those for the rat, zebrafish, and sea squirt.
There is also a lot of momentum for establishment of a public-private partnership to identify all the haplotypes in the human genome. Haplotypes are blocks of genomic DNA characterized by sets of specific base variations (single-nucleotide polymorphisms). They can be treated as units, potentially simplifying genomic analysis. "The notion is if one could define human variation at this level, across all the chromosomes, we would have a much more powerful way of identifying genetic contributions to things like diabetes, Alzheimer's disease, heart disease, and cancer," Collins says.
The Mammalian Gene Collection Project is trying to assemble sets of full-length DNA sequences and cDNA clones for expressed mouse and human genes. And Institute of Proteomics Director Joshua LaBaer and coworkers at Harvard Medical School have initiated a complementary effort, the FLEXGene Project, to create an expression-ready repository of DNA plasmids containing all genes from humans and other organisms.
At the same time, the Alliance for Cellular Signaling is organizing a large-scale collaboration to understand the complexities of cell signaling pathways. In addition, the Protein Structure Initiative aims to determine the three-dimensional structure of all proteins in a high-throughput manner (C&EN, Oct. 15, page 23). "And, of course, there's all the effort that's going into informatics tools and databases," Collins says.
The cell signaling and protein structure efforts are arguably subsets of a grand human proteome project similar in scope to the Human Genome Project. Should a comprehensive human proteome project be started?
Such an initiative, which would "try to understand all the human proteins and those of other complex organisms, is really considerably more challenging" than the Human Genome Project, Collins says. "It's hard to even say for sure that this is a bounded enterprise, whereas in sequencing the human genome you knew there was going to be a finite number of base pairs, and it was a matter of determining what they were."
It would therefore be challenging to define the overall goals of a human proteome project, he says. "Experimentally, maybe there are some die-hard biochemists who think proteins are easier to work with than nucleic acids, but I'm not one of those," Collins says. For example, proteins have a habit of getting phosphorylated and glycosylated all the time--endowing them with a lot more decorative elaboration than is found in DNA. "You could argue that DNA also has methylation and a few other things, but nothing like the complexity of proteins."
Furthermore, DNA sequence information is digital. "When you get it, you've got it," he says, "whereas much of the data that one wants to accumulate about proteins is going to be more of an analog sort, depending on things like dissociation constants and kinetics that are hard to measure. I think that adds a level of complexity that should not be underestimated."
CLEARLY, THE DATA integration and analysis challenge is going to be much greater in proteomics than in genomics, because proteomics data are of widely diverse types "and hooking them up together is not at all obvious," Collins adds. "And I suspect that even though intellectual property was a pretty hot topic when it came to genomics, it will be even hotter when it comes to proteomics, because you're one step closer to things that sound like they might be targets for products."
This is a pretty big list of challenges for a human proteome project to overcome. However, "I don't want to be discouraging," Collins says. "I actually think we can identify areas of this that are quite ripe for organized attack along the lines of what the genome project did. But we ought to be thoughtful about doing it."
In any case, technology developments are key to making a proteome project more feasible, he says. "We are not in the circumstance of being able to say we have a lot of mature proteomics technologies" right now.
Of course, there have been some notable successes in proteomics technology development. One example is the ICAT (isotope-coded affinity tag) strategy developed a few years ago at the University of Washington, Seattle, by molecular biotechnology professor Rudolph Aebersold (now at the Institute for Systems Biology) and coworkers [Nat. Biotechnol., 17, 994 (1999)]. ICAT uses isotopically labeled reagents and tandem mass spectrometry (MS) to quantitate and identify expressed proteins from two samples, such as normal cells and cancer cells, for proteomics analysis.
ANOTHER TECHNIQUE, developed by analytical chemistry professor Barry L. Karger, director of the Barnett Institute at Northeastern University, Boston, couples high-resolution separation of proteins and protein digests by capillary electrophoresis (CE) or liquid chromatography (LC) with matrix-assisted laser desorption/ionization (MALDI) MS. With this method, one takes the output from a capillary or column, mixes it with a matrix (such as a-cyano-4-hydroxycinnamic acid), uses a capillary to deposit it on a ribbon or plate, and analyzes it by MALDI MS.
Normally, CE and LC effluent samples deposited on surfaces are "irreproducible and inhomogeneous blobs," Collins says, whereas the capillary-deposited samples are narrow (about 100 mm wide) and very homogeneous--so "you can imagine there might be a better likelihood of getting a reproducible signal if you zap them with your laser." The technique makes it feasible to use multiplex CE or LC--instead of slower and less highly resolved 2-D gel electrophoresis--as high-throughput front ends for MALDI analysis in proteomics studies.
NHGRI has been vigorously soliciting and funding proteomics-related technology proposals. "A new form of funding that NHGRI has announced, and which we expect over the course of the next five or 10 years may grow to occupy a substantial part of our budget," Collins says, are Centers of Excellence in Genomic Science (CEGS) grants. "These will involve multiple investigators with different disciplinary expertise but focused around a common theme. The centers will allow the request for substantial amounts of funds for core facilities for expensive instrumentation, which is difficult for single investigators to obtain."
The way to set up a human proteome project would be "to try to identify components of understanding the proteome that are ripe for this kind of organized effort," Collins says. "The kind of questions one would want to ask are: Is the technology ready? Has pilot feasibility been demonstrated? Are there concrete deliverables? Is there a defined timetable? Can we afford this? Do we have quality-control standards that everybody has agreed on, and do we know how to maintain those? And are we making sure data access has been guaranteed?" When all those requirements have been defined and satisfied, "then I think it's time to go," he says.
"It's hard to predict the future, of course, and I'm not even going to try," he adds. But he says he's comforted by a quotation from French author Antoine de Saint-Exupéry: "As for the future, your task is not to foresee it, but to enable it."
So if bits and pieces of a proteome project were to begin to come together, they should be enabling, Collins says. The idea is "not necessarily to be able to look off in the distance and see where we're going, but only to agree that we should hurry up and get there."
CAROLYN BERTOZZI, UNIVERSITY OF CALIFORNIA, BERKELEY
GLYCOFORMS Glycoprotein biosynthesis in cells (center) generates mixtures of glycosylated proteins (glycoforms) on cell surfaces (right). The heterogeneity of such glycoforms is one factor that dramatically complicates proteomics, as compared with genomics.
Chemical & Engineering News
Copyright © 2001 American Chemical Society