About MDD - Subscription Info
March 2001
Vol. 4, No. 3, pp 69–70
sites and software
<title>XML: Data about data</title>
<summary>Metadata may allow computers to understand what scientists are talking about.</summary>

Proficiency at collecting biological data has overcome the capacity for analyzing it. The avalanche of data requires standardization of storage, sharing, and publishing techniques. Many current standards for biological data communications are not meant for both human and computer reading, and thus bottlenecks have resulted in sharing and publishing data and in automating data analysis. One method of relieving these bottlenecks uses metadata, such as the extensible markup language (XML), to help manage the avalanche of biological data.

Metadata is data about data. While this may sound confusing, metadata provides context to data. For example, when you look at this page, you know that the department title is at the top right, the article title is in bold starting at the left, and a brief description of the article is below the title. You, as a human who has seen magazines before, do not need tags to appear in this article, such as <title> and <summary>. However, the computer used to publish this magazine has no such method of understanding even though it has seen hundreds of magazines. But if the title and head were labeled, almost any computer could understand the significance of the information.

XML is a computer language that is similar to hypertext markup language (HTML), which is used to build Web pages. However, XML is used to tag data and thus provide the metadata, or context, for different pieces of information. It is a general language for telling computers what the data is. The exciting aspect of XML is the flexibility and robustness that it brings to the Web and to databases. Just as HTML provided the method for presenting documents, XML provides the method for defining the meaning or semantics of data.

XML and biotechnology
In the drug discovery and biotechnology industries, XML has become the foundation of several markup languages used for storing biological data. Many related research efforts have defined metadata markup languages based on XML to support biological and genetic data collection, management, retrieval, and analysis. These XML-based markup languages (also called dialects or vocabularies) can help control the explosive growth of data produced by life sciences research, the Human Genome Project, and other genetic and proteomic databases. Some of the vocabularies are described below. Because of the foundation of XML, it is not difficult to transform data from one vocabulary to another by using various tools, such as the XML stylesheet language transformation (XSLT). (The specification for XSLT can be found at www.w3c.org/TR/xslt.)

BIOML. The BIOpolymer Markup Language (BIOML) is an XML language that is used to describe experimental information about proteins, genes, and other biopolymers. A BIOML document uses data tags such as <protein> and <homologs> to describe a specific biopolymer and all of its associated experimental information in a logical and meaningful way. The advantage of using a markup language is that it employs an XML-based tree structure that maps easily into the biopolymer information that is hierarchical and nested at different levels of complexity.

Although BIOML’s goal is the transfer of information between computers, the stylesheet information available when using an XML-based approach simplifies the task of displaying that information on various types of browsers. BIOML was designed and written by Ron Beavis, David Fenyö (ProteoMetrics, LLC), Brian Chait (Rockefeller University), and David States (Washington University in St. Louis). [www.proteometrics.com/BIOML]

PSDML. The Protein Sequence Database Markup Language is an open-standard markup language used to store protein information in the Protein Information Resource (PIR) database. PIR is an annotated, public-domain protein sequence database containing 195,891 annotated and classified entries. The PIR Web site allows sequence similarity and text searching of the database. PIR is a collaboration between the National Biomedical Research Foundation at Georgetown University Medical Center, the Munich Information Center for Protein Sequences, and the Japan International Protein Information Database. [http://pir.georgetown.edu/]

BSML. The Bioinformatic Sequence Markup Language is an open-standard protocol for the encoding and display of graphic genomic displays of DNA, RNA, and protein sequence information. All information underlying the graphical representation is contained in the BSML document, allowing the user to drill down through the data to any level of resolution, from chromosome to base pair. Dynamic, interactive data mining is facilitated by an intuitive point-and-click visual display. The Web-based Basic Browser displays BSML documents and imports gene sequences from local or remote repositories such as GenBank. The Basic Browser and the BSML documentation are free when downloaded from LabBook’s Web site. [www.labbook.com]

BSML is a TopoGEN project developed by Visualgenomics, Inc. (now LabBook), and is funded by a Small Business Innovative Research Grant from the National Center for Human Genome Research to develop the public-domain protocol. The principal investigator of the SBIR with which this project is associated is Joseph Spitzner, who is TopoGEN’s software director. [www.labbook.com/products/browser.asp]

GAME. Genome Annotation Markup Elements is a markup language used in molecular biology for annotation of a biosequence. GAME is part of BIOXML’s overall goal of providing a set of orthogonal XML vocabularies for molecular biology. These vocabularies define different facets of biotechnology and can be combined to create more expressive capabilities. [www. bioxml.org/Projects/game]

MSAML. The Multiple Sequence Alignments Markup Language was developed to make manipulation and extraction of multiple sequence alignment information easier by logically defining the parts of an alignment for use in an XML-based application.

MSAML is a lightweight definition of MSA because it is not a perfect MSA representation. An MSAML document may pass a validating XML parser test and still contain some MSA errors. And it does not capture all facets of an MSA, which requires two-dimensional data formatting. [http://maggie.cbr.nrc.ca/~gordonp/xml/MSAML]

Microarray data
GEML. The Gene Expression Markup Language is an open-standard markup language for DNA microarray and gene expression data for chip patterns and chip scan profiles.

The GEML format is designed to be broadly applicable and to support easy exchange of data among a variety of gene expression systems, including Web-based genome databases. GEML separates data reporting and collection from methodology and therefore stores information about data collection methodology without evaluating the measurement. This enables normalization, integration, and comparison of data across methodologies. GEML is also independent of any particular database schema. GEML handles profile data independent of pattern reference or image data format. [www.ncgr.org/research/genex/genexml.html]

MAML. The Microarray Markup Language is an XML-based vocabulary for describing and communicating information about DNA array experiments. Like GEML, MAML is independent of specific platforms and provides a method of describing experiments performed on all types of DNA arrays, including spotted and synthesized arrays, as well as oligonucleotide and cDNA arrays. It is also independent of image analysis and data normalization methods. This XML vocabulary represents raw and processed microarray data. It is compatible with the definition of the minimum information about a microarray experiment (MIAME) proposed by the MGED (Microarray Gene Expression Database) group. MAML is part of the MGED initiative and is being developed by a community of developers, including the Lawrence Berkeley National Laboratory, National Center for Biotechnology Information, European Bioinformatics Institute, National Center for Genome Resources, and Stanford University. [http://beamish.lbl.gov]

CellML is used to store and exchange computer-based biological models. It allows scientists to share models even if they use different model-building software. It also enables them to reuse components from one model to another, which accelerates model building. CellML includes information about model structure (how the parts of a model are organized and related to one another) and metadata (information about the model that allows scientists to search for specific models or model components in a database or other repository).

CellML is developed by Physiome Sciences, Inc., in conjunction with the bioengineering research group at the University of Auckland’s department of engineering science and affiliated research groups. [www.cellml.org]

SBML. The Systems Biology Markup Language uses XML and UML (unified modeling language) for representing and modeling the information in systems biology simulation software. The goal is to enable simulation software to communicate and exchange models across multiple software packages. The SBML representation language is organized around five categories of information: model, compartment, geometry, specie, and reaction.

The Caltech ERATO Kitano Systems Biology Project has developed SBML by merging the most obvious modeling-language features of BioSpice, DBSolve, E-Cell, Gepasi, Jarnac, StochSim, and Virtual Cell. [www.cds.caltech.edu/erato]

Chemical Markup Language (CML) was one of the first applications developed using XML. It allows for the conversion of chemical information files without semantic loss, structured documents including chemical publications, and precise location of information in files. In simple terms, it is “HTML with molecules”. CML has been designed so that it is easy for the average chemist to understand, although it helps if you know something about HTML. It is not magic, and it is simple chemical common sense, which hides a lot of the information detail that chemists have to deal with.

The Java Universal Molecular Browser for Objects (JUMBO) was the first XML browser, developed to support CML while XML was only a working draft. Although JUMBO was designed for CML, it may very well be of interest to anyone working with XML. It is excellent as a learning tool and XML creation tool. Newer versions, called JUMBO2 and JUMBO3, have increased functionality. [www.xml-cml.org]

XML can help researchers analyze, manipulate, share, and publish biotechnology data easily and efficiently. Further development of XML standards will allow researchers to work with their strengths and developers to worry about the details. Hopefully, these developments will also allow the creation of standards around a small set of data formats that are simple to understand and parse, but which will become the basis for vocabularies and metadata of any complexity.

These examples are just a few of the available XML vocabularies used as metadata for representing various aspects of biotech research. The different vocabularies may be combined as needed to address unique problems.

In the future, new XML-based metadata standards may arise, at which time it may make sense to migrate to these new standards. One last data element that would greatly ease the standardization of genomic data is an agreed-upon standard for a “unique biological sequence locator”, which would simplify many of the different attempts to reference a biological sequence.

Hank Simon has worked in information technology (IT), IT architectures, and XML technologies for 25 years.

Send your comments or questions regarding this article to mdd@acs.org or the Editorial Office by fax at 202-776-8166 or by post at 1155 16th Street, NW; Washington, DC 20036.

Return to Top || Table of Contents

 CASChemPortChemCenterPubs Page