Vol. 5, No. 3, pp 2830, 32.
Life scientists need advanced computing skills to discover more about genomics and proteomics.
Why do biologists need to be computer savvy, or at least savvier than they already are? After all, with so many commercial and public tools for analyzing data, why would biologists need to know how to make their own programs? In the end, most biologists will not need to know more than how to operate the tools used to investigate the genome and its successors (proteome, metabolome, etc.). However, the individuals who will contribute most to the field will be those who can make tools to do exactly what they want, rather than just rely on preconceived software for looking at data.
In the past, biologists and chemists made improvements in their laboratory methods, making dramatic leaps in understanding, but also enabling other scientists to discover new information based on the improved techniques. Take, for example, the polymerase chain reaction. This discovery cobbled together existing information, but it became a tool that thousands of scientists have used to further their own study.
That was 17 years ago. With human and other genomes now stored in giant databases, who knows whether the next leap will be the development of a novel program or a unique way to couple existing programs, rather than a new wet laboratory technique. What is certain is that a significant part of the future of biology lies with computer methods.
Using computers for biological investigation causes several fundamental problems. Existing software programs do not, and cannot, cover every idea or method of conducting research. Biologists must learn programming or other computer skills to investigate hypotheses. Computer code may become part of the scientific review and lead to arguments about open source and commercial software and how results and methods can be peer-reviewed if the techniques or data are not made public.
Beyond existing software
Compatibility and interoperability have been issues ever since the second computer program was written. Often, programs carry out necessary functions, but the results are not suitable as input in the next program or for human use. These compatibility issues are tackled by using standard file formats like XML. While standard file formats help with compatibility, they are not a cure-all; patches and workarounds are often needed to allow programs to communicate (or interoperate).
For example, Kevin Durfree, director of computing at Microbia, Inc. (Cambridge, MA), a biotechnology company working on anti-infectives, says that he and his co-workers routinely develop programs to work beyond commercial software. We use a large enterprise-class [a program used company-wide] commercial software package to store our sequence data along with copies of several publicly available sequence repositories. This system includes the capability to do automated BLAST searches against one or more targets. That in itself is a nice and important feature. Our problem is that the reports that were returned from the search were not in a format for quick analysis. We needed to be able to sort the results by best hit and quickly extract the relevant information. Were currently working on a project to write a postprocessor that will extract the relevant information and format it the way we want it. This will be done using Perl, since it handles processing of textual information well. (See box, Perl.)
Of course, existing programs are created by people who use their logic and perspective. A developer views the problem that the program solves in one way, and not everyone using the software will view the problem with the same logic. In other words, Well, I dont visualize it that way, I want to look at it like this. These differences in understanding lead individuals to choose different products in all aspects of life, not just in discovery tools. This individuality limits the extent to which one software package will satisfy everyone.
How a program functions may be poorly understood by the users, which in science can be troublesome. The extremely popular BLAST genetic alignment program, which is publicly available, is a prime example (see Sites and Software). Christopher Dwan, a research programmer at the University of Minnesotas Center for Computational Genomics and Bioinformatics, comments, In spite of its wild popularity, the algorithmic details of BLAST are poorly understood by its community of users. Of particular importance is the fact that BLAST uses a heuristic approximation algorithm that gains speed at the expense of accuracy of results (www.oreillynet.com/pub/a/network/2001/11/30/speedup.html).
For these reasons, software that is either packaged or in the public domain often needs to be modified. Durfree adds, You end up having to customize almost any software package that you buy in some manner or another. This is not to say that the commercial software is faulty or incomplete, just that it ends up being used in ways other than originally intended.
To adapt to developing these types of programs, from automating to new methods of analysis, biologists are learning computer programming. Knowledge of programming languages will only increase as more experimentation is done with computers.
What programming languages do biologists learn out of the hundreds available? They choose a language by considering several factors, including what other researchers are already using in a field, the availability of experts to help them, and the language in which an existing program is written. All in all, several languages are widely used in programming for bioinformatics, including Perl, Java, C, and C++ (see boxes, Perl and C and C++).
The debate in science, software, and publishing is whether the greatest progress will be made with information that is publicly available (open source) or controlled by private organizations. The argument in each field is extremely complex. On the science end, many scientists agree that discoveries made with the aid of, or at least based significantly on, a computer program must include a disclosure of the program and what was done. For example, if the data for a cell growth experiment were manipulated with statistical software, the title and version of the software, and exactly what operations were conducted, are to be discussed or referenced. This procedure is similar to publishing how a wet experiment was accomplishedwhat columns, buffer, cell growth medium, and temperature were used. However, if a new program or algorithm was developed for this experiment, fellow researchers would not be able to reproduce the results unless the code or algorithm is provided, and reproducibility is the foundation of science.
But where does that leave the developers of a programnot only the individual or team, but also the institution they work for and the organization that funded the research? The right to apply for a patent expires one year after publication of an invention, but software code is also protected by copyrights. Copyrights allow the author to retain rights to the code or writing while allowing it to be publicly viewed. Licenses controlling the copyright are becoming widely used to allow access to software code and control future use of the code.
Licensing for open source
An extensive list has been compiled at the Free Software Foundation (FSF) website (www.fsf.org/licenses/license-list.html).
The FSF discusses many types of licenses, but its discussions are skewed toward the extremely open copyleft license. The term copyleft is used to indicate how opposite it is from copyright. A program that is copyleft is free to be used, studied (by viewing the source code), improved, changed, and distributed in its original or improved versions, but only in the same free manner that the original was distributed in. This allows improvements to be made, but no one may profit from the software. The FSF says, Anyone who redistributes the software, with or without changes, must pass along the freedom to further copy and change it.
Of course, on the other hand, if a researcher does not publish the code at all, then verification of results is impossible. There are licenses that protect a researchers future interest but allow additional scientific verification, as well as improvements, to be made in the programs. HMMer (http://hmmer.wustl.edu/), a program that searches databases using profile hidden Markov models, is available from Washington University in St. Louis with a GNU General Public License, which states that all derivations must be made public. However, the webpage indicates that the university is open to licensing the HMMer program for use in software that is to be sold. The university still holds the copyright, while the additional developers must keep their developments public.
With the debate raging, Jason Stewart, Harry Mangalam, and Jiaye Zhou have taken a lead in the science realm with a petition. Their extensive arguments for open source licensing of scientific software code developed using public money can be viewed at the website they founded, www.openinformatics.org. The NIH and the NSF currently encourage programs developed using grants to be shared publicly.
On the other side of the debate, companies that produce software or provide it to researchers wish to keep control of source code, making only the working program available publicly. Although commercial software is often more polished and user-friendly than open source programs, it does introduce concerns. Unlike laboratory equipment, which can be disassembled and examined, working programs are difficult to disassemble and examine.
If commercial packages are available, then it is just like a researcher buying equipment and reagents to reproduce a colleagues findings. However, the question of errors in a program remains unsolved because they are reproduced by each successive use of the program.
A new twist makes the commercial availability of software or data an issue. Data is now being stockpiled in databases such as Celera Genomics human genome database, and access is offered as a subscription. Is this to be considered like any other piece of equipment that researchers need to obtain to reproduce a finding? Or is a subscription system contrary to the scientific principle of making the methods and data public along with the findings?
In one case, research conducted using Celeras data was not compared with its public counterpart, which led to misunderstood findings. A study of Celeras database led to an estimate of about 30,000 genes in the human genome, and a similar study using the public Human Genome Project database led to a similar finding. However, when the Genomics Institute of the Novartis Research Foundation examined the previous studies, with access to Celeras data through its parent company, it found little overlap in genes from the two studies. This means that there are many more genes, because if there were only 30,000, then the two sets should have overlapped perfectly.
Michael J. Felton is an assistant editor of Modern Drug Discovery. Send your comments or questions regarding this article to email@example.com or the Editorial Office by fax at 202-776-8166 or by post at 1155 16th Street, NW; Washington, DC 20036.