<summary>Metadata may allow computers to understand what scientists are talking about.</summary>
Proficiency at collecting biological data has overcome the capacity for analyzing it. The avalanche of data requires standardization of storage, sharing, and publishing techniques. Many current standards for biological data communications are not meant for both human and computer reading, and thus bottlenecks have resulted in sharing and publishing data and in automating data analysis. One method of relieving these bottlenecks uses metadata, such as the extensible markup language (XML), to help manage the avalanche of biological data.
Metadata is data about data. While this may sound confusing, metadata provides context to data. For example, when you look at this page, you know that the department title is at the top right, the article title is in bold starting at the left, and a brief description of the article is below the title. You, as a human who has seen magazines before, do not need tags to appear in this article, such as <title> and <summary>. However, the computer used to publish this magazine has no such method of understanding even though it has seen hundreds of magazines. But if the title and head were labeled, almost any computer could understand the significance of the information.
XML is a computer language that is similar to hypertext markup language (HTML), which is used to build Web pages. However, XML is used to tag data and thus provide the metadata, or context, for different pieces of information. It is a general language for telling computers what the data is. The exciting aspect of XML is the flexibility and robustness that it brings to the Web and to databases. Just as HTML provided the method for presenting documents, XML provides the method for defining the meaning or semantics of data.
XML and biotechnology
BIOML. The BIOpolymer Markup Language (BIOML) is an XML language that is used to describe experimental information about proteins, genes, and other biopolymers. A BIOML document uses data tags such as <protein> and <homologs> to describe a specific biopolymer and all of its associated experimental information in a logical and meaningful way. The advantage of using a markup language is that it employs an XML-based tree structure that maps easily into the biopolymer information that is hierarchical and nested at different levels of complexity.
Although BIOMLs goal is the transfer of information between computers, the stylesheet information available when using an XML-based approach simplifies the task of displaying that information on various types of browsers. BIOML was designed and written by Ron Beavis, David Fenyö (ProteoMetrics, LLC), Brian Chait (Rockefeller University), and David States (Washington University in St. Louis). [www.proteometrics.com/BIOML]
PSDML. The Protein Sequence Database Markup Language is an open-standard markup language used to store protein information in the Protein Information Resource (PIR) database. PIR is an annotated, public-domain protein sequence database containing 195,891 annotated and classified entries. The PIR Web site allows sequence similarity and text searching of the database. PIR is a collaboration between the National Biomedical Research Foundation at Georgetown University Medical Center, the Munich Information Center for Protein Sequences, and the Japan International Protein Information Database. [http://pir.georgetown.edu/]
BSML. The Bioinformatic Sequence Markup Language is an open-standard protocol for the encoding and display of graphic genomic displays of DNA, RNA, and protein sequence information. All information underlying the graphical representation is contained in the BSML document, allowing the user to drill down through the data to any level of resolution, from chromosome to base pair. Dynamic, interactive data mining is facilitated by an intuitive point-and-click visual display. The Web-based Basic Browser displays BSML documents and imports gene sequences from local or remote repositories such as GenBank. The Basic Browser and the BSML documentation are free when downloaded from LabBooks Web site. [www.labbook.com]
BSML is a TopoGEN project developed by Visualgenomics, Inc. (now LabBook), and is funded by a Small Business Innovative Research Grant from the National Center for Human Genome Research to develop the public-domain protocol. The principal investigator of the SBIR with which this project is associated is Joseph Spitzner, who is TopoGENs software director. [www.labbook.com/products/browser.asp]
GAME. Genome Annotation Markup Elements is a markup language used in molecular biology for annotation of a biosequence. GAME is part of BIOXMLs overall goal of providing a set of orthogonal XML vocabularies for molecular biology. These vocabularies define different facets of biotechnology and can be combined to create more expressive capabilities. [www. bioxml.org/Projects/game]
MSAML. The Multiple Sequence Alignments Markup Language was developed to make manipulation and extraction of multiple sequence alignment information easier by logically defining the parts of an alignment for use in an XML-based application.
MSAML is a lightweight definition of MSA because it is not a perfect MSA representation. An MSAML document may pass a validating XML parser test and still contain some MSA errors. And it does not capture all facets of an MSA, which requires two-dimensional data formatting. [http://maggie.cbr.nrc.ca/~gordonp/xml/MSAML]
The GEML format is designed to be broadly applicable and to support easy exchange of data among a variety of gene expression systems, including Web-based genome databases. GEML separates data reporting and collection from methodology and therefore stores information about data collection methodology without evaluating the measurement. This enables normalization, integration, and comparison of data across methodologies. GEML is also independent of any particular database schema. GEML handles profile data independent of pattern reference or image data format. [www.ncgr.org/research/genex/genexml.html]
MAML. The Microarray Markup Language is an XML-based vocabulary for describing and communicating information about DNA array experiments. Like GEML, MAML is independent of specific platforms and provides a method of describing experiments performed on all types of DNA arrays, including spotted and synthesized arrays, as well as oligonucleotide and cDNA arrays. It is also independent of image analysis and data normalization methods. This XML vocabulary represents raw and processed microarray data. It is compatible with the definition of the minimum information about a microarray experiment (MIAME) proposed by the MGED (Microarray Gene Expression Database) group. MAML is part of the MGED initiative and is being developed by a community of developers, including the Lawrence Berkeley National Laboratory, National Center for Biotechnology Information, European Bioinformatics Institute, National Center for Genome Resources, and Stanford University. [http://beamish.lbl.gov]
CellML is developed by Physiome Sciences, Inc., in conjunction with the bioengineering research group at the University of Aucklands department of engineering science and affiliated research groups. [www.cellml.org]
SBML. The Systems Biology Markup Language uses XML and UML (unified modeling language) for representing and modeling the information in systems biology simulation software. The goal is to enable simulation software to communicate and exchange models across multiple software packages. The SBML representation language is organized around five categories of information: model, compartment, geometry, specie, and reaction.
The Caltech ERATO Kitano Systems Biology Project has developed SBML by merging the most obvious modeling-language features of BioSpice, DBSolve, E-Cell, Gepasi, Jarnac, StochSim, and Virtual Cell. [www.cds.caltech.edu/erato]
CML and JUMBO
The Java Universal Molecular Browser for Objects (JUMBO) was the first XML browser, developed to support CML while XML was only a working draft. Although JUMBO was designed for CML, it may very well be of interest to anyone working with XML. It is excellent as a learning tool and XML creation tool. Newer versions, called JUMBO2 and JUMBO3, have increased functionality. [www.xml-cml.org]
These examples are just a few of the available XML vocabularies used as metadata for representing various aspects of biotech research. The different vocabularies may be combined as needed to address unique problems.
In the future, new XML-based metadata standards may arise, at which time it may make sense to migrate to these new standards. One last data element that would greatly ease the standardization of genomic data is an agreed-upon standard for a unique biological sequence locator, which would simplify many of the different attempts to reference a biological sequence.
Send your comments or questions regarding this article to firstname.lastname@example.org or the Editorial Office by fax at 202-776-8166 or by post at 1155 16th Street, NW; Washington, DC 20036.