C&EN logo The Newsmagazine of the Chemical World
Home Current Issue ChemJobs Join ACS
Latest News
Government & Policy
Careers and Employment
ACS News
How to log in
Contact Us
Site Map
About C&EN
About the Magazine
How to Subscribe
How to Advertise

Latest News RSS Feed

latest news RSS feedWhat is this?

Join ACS
Join ACS
  Science & Technology  
  August 22, 2005
Volume 83, Number 34
pp. 39-40

Free IUPAC software converts structures to computer-readable representations


After five years of labor, the International Union of Pure & Applied Chemistry has unveiled a method to create unique, computer-readable identifiers for chemical compounds. The source code and software needed to convert chemical structures into these labels--dubbed International Chemical Identifiers (InChIs)--are available for free on IUPAC's website. With the new method, the organization hopes to facilitate the exchange of chemical information, particularly through digital applications.

IUPAC initiated the InChI project in 2000 to "establish a unique, machine-generated label for any chemical structure, which would serve as a nonproprietary identifier for use in printed and electronic data sources, thus enabling easy linking of diverse data compilations," according to Alan D. McNaught, president of IUPAC's Chemical Nomenclature & Structure Representation Division. "There were two main reasons for wishing to do this," he adds. "First, the increasing complexity of molecular structures that chemists were dealing with routinely, making conventional naming procedures inconveniently cumbersome; and second, the lack of a suitable, openly available electronic format for exchanging chemical structure information over the Internet."

McNaught headed the InChI project team. The other key players included Stephen R. Heller, Stephen E. Stein, and Dmitrii Tchekhovskoi, who are all associated with the Physical & Chemical Properties Division of the National Institute of Standards & Technology (NIST). Several other volunteers contributed their time to the project.

The InChI algorithm converts a chemical structure drawn with software into an alphanumeric string of characters. The program can also convert an InChI label back into a molecular structure.

The different types of structural information--atomic connectivity, isotopes, stereochemistry, electronic charge, and so on--are represented separately within the InChI string and are divided by slash marks.

The string for naphthalene, for instance, is InChI=1/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H. The first "1" refers to the version of the InChI software. The next segment of the string, C10H8, provides the molecular formula. The third segment is the connection table, which indicates how the atoms are connected. The last segment in this example provides information about the placement of hydrogen atoms.

For now, the InChI algorithm can handle neutral and ionic organic molecules; radicals; and inorganic, organometallic, and coordination compounds. The program lacks a method to describe excited states, though this capability will be added. Future versions may also be able to process polymers and Markush structures, which rely on a single structure drawing and a list of possible substituents--usually designated in the drawing with an "R"--to represent a set of similar chemical compounds.

CAS, Elsevier Employ Registry Numbers To Identify Substances

There's more than one way to unambiguously specify the identity of a chemical. The International Union of Pure & Applied Chemistry's new International Chemical Identifiers (InChIs) method specifies a compound by a computer-readable alphanumeric string that describes its structure. Other systems, such as those used by Chemical Abstracts Service (CAS) and Elsevier, rely on assigned numbers to specify particular chemicals. Unlike InChIs, however, these numbers contain no structural information.

As of this August, CAS has assigned registry numbers to 26 million chemical compounds and 57 million biosequences. Registered substances run the gamut from organic and inorganic compounds to alloys, polymers, and nucleic acids.

The CAS Registry even provides numbers for substances whose chemical structures can't be specified, including mixtures such as petroleum distillate.

"No InChI standard can describe this substance, nor can any connection table," notes Jeff Wilson, manager of the CAS Registry Authority Database. CAS's ability to uniquely identify this type of complex material "is an important contribution to the world in that these are the substances being manufactured and shipped around our environment and they are easily identified by the CAS registry number."

Each number serves as a link to an extensive substance record in the CAS databases. The record contains the compound's structure, chemical names, molecular formula, properties, and other data.

Searches for compounds identified by their registry numbers can be conducted in CAS databases for a fee or in databases such as the Environmental Protection Agency's Substance Registry System for free. References can also be found with search engines such as Google.

Elsevier also generates registry numbers for chemicals listed in its Crossfire Beilstein and Gmelin databases, but these are different from CAS's.

The Beilstein database contains registry numbers for more than 9 million small organic molecules, while the Gmelin database contains registry numbers for more than 2 million inorganic and organometallic compounds, according to Product Manager Jochen Tannemann. References for a particular compound can be retrieved from the databases through its registry number; a license is required for the search.

DREAM TEAM NIST's Heller (from left), Tchekhovskoi, and Stein were instrumental in developing the InChI method.
IUPAC hopes that developers of commercial chemical software as well as database compilers and publishers of chemical information will incorporate the InChI algorithm in their products to enable sharing of molecular information among chemical scientists. Already, the InChI software has been included as a component of Chemical Markup Language. CML is an extension of XML--Extensible Markup Language, which serves as a standardized method to format documents on the Web--that can handle chemical information.

Nature Chemical Biology is using InChIs. Several organizations have also adopted the practice for their databases, including NIST, the National Institutes of Health, and Thomson ISI, a provider of scientific information services. The U.S. and European patent offices are also considering InChIs.

On its website, IUPAC lists several potential future applications for InChIs: "communication between databases, merging data collections developed using different systems/protocols, maintaining a laboratory chemical inventory, and passing the 'identity' of a substance to a colleague for use in any of the above."

The program could also be useful for chemical suppliers. "If your catalog is InChIfied, then it can be easily indexed automatically by search engines, giving you a greatly increased exposure," according to a website that includes answers to frequently asked questions (FAQs) about InChIs (wwmm.ch.cam.ac.uk/inchifaq).

The research group of Cambridge University molecular informaticist Peter Murray-Rust prepared the FAQ site. It provides a link to another site where users can draw a chemical structure, convert it to its InChI, and search for that InChI on the Web by using Google. Many InChIs have already been created. For instance, Murray-Rust's group has already published InChIs for more than 250,000 compounds. And the number of InChIs in NIH's PubChem database is expected to top 4 million this month.

InChIs are taking hold in the chemical arena, but they aren't the only game in town. Other identification methods for chemicals include the Simplified Molecular Input Line Entry Specification (SMILES) language. David Weininger, now president of the cheminformatics firm Daylight Chemical Information Systems, began developing this system in 1983 while at the Environmental Protection Agency. Daylight continues to refine and develop the technology.

Like the InChI program, the Daylight SMILES language yields a computer-readable representation of a molecular structure. In fact, SMILES notation is "one of the most widely used formats for representing molecules in a computer," notes Terry Brunck, vice president of R&D at Daylight. SMILES can also be used to represent a reaction.

A SMILES representation can be generated by hand by someone who learns the language; it can also be created with software purchased or licensed from Daylight or other firms. Daylight's software uses an algorithm to produce definitive SMILES that "are relied upon by many pharma companies as the primary unique identifier in corporate databases containing several million structures," according to Brunck. SMILES are also used in publicly accessible databases such as Harvard University's Chembank and NIH's PubChem.

SMILES notation can be used to describe well-defined organic, inorganic, isotopic, and chiral molecules. For example, Daylight's unique SMILES for caffeine is Cn1cnc2n(C)c(=O)n(C)c(=O)12. Here the equal sign represents a double bond; the parentheses indicate branching; C's are aliphatic carbons; c's and n's are aromatic carbons and nitrogens, respectively; and the digits represent ring closures.

The SMILES language isn't suited for some applications, however. "It is not designed for describing noncovalent chemistry, including hydrogen bonding, van der Waals interactions, hydrophobic interactions, polymers, and complex mixtures such as beeswax," Brunck notes.

SMILES has some addiitional limitations. For one, the SMILES algorithms offered by firms other than Daylight can produce different identifiers for the same compound, he says. Furthermore, SMILES algorithms are proprietary, unlike the InChI algorithm.

The open nature of the InChI system makes it more appealing to those who support the concept of open-access chemical information, including Henry S. Rzepa, a computational chemistry professor at Imperial College London. Rzepa is doing his part to promote the use of InChIs. He says he and his research team "now routinely add InChI identifiers to all molecules reported as part of our own research, making them part of the supporting information submitted to many ACS journals."

PICKY, PICKY InChIs can be as general or specific as desired. For instance, additional "layers" can be added to the basic InChI for tetrahydro-2-(methylthio)furan (top) to specify a particular isotope and stereoisomer.
PICKY, PICKY InChIs can be as general or specific as desired. For instance, additional "layers" can be added to the basic InChI for tetrahydro-2-(methylthio)furan (top) to specify a particular isotope and stereoisomer.
  Chemical & Engineering News
ISSN 0009-2347
Copyright © 2005

to a friend
Print this article
E-mail the editor