C&EN logo The Newsmagazine of the Chemical World
Home Current Issue ChemJobs Join ACS
Latest News
Government & Policy
Careers and Employment
ACS News
How to log in
Contact Us
Site Map
About C&EN
About the Magazine
How to Subscribe
How to Advertise

Latest News RSS Feed

latest news RSS feedWhat is this?

Join ACS
Join ACS
  Science & Technology  
  April 25, 2005
Volume 83, Number 17
pp. 24-29


Modelers in need of proprietary compounds seek ways to share information, but not structure
MASKING CAFFEINE Different strategies for masking the true structure of the test molecule caffeine (stick, top left, and highest occupied molecular orbitals, top right) include representing its lipophilic potential and electrostatic potential (bottom left and right).

  It's ironic that one of the things a drug company is least likely to do--give out information about its jealously guarded lead compounds--could slash the formidable amount of money and time it spends to develop new drugs.

Tudor Oprea, a biocomputing professor at the University of New Mexico School of Medicine, Albuquerque, and founder of Sunset Molecular Discovery, would like to see drug companies get over the fear that the unfettered exchange of compound data would mean the certain pirating of their intellectual property.

TALKING IT OVER Lipinski (left) replies to a point made by Oprea.


What's more important, he says, is that currently, woefully sparse and disparate biochemical toxicology and metabolic information be pooled and made available to data-starved modelers in the chemoinformatics arena. Then they could finally put together decent models to predict potential drugs' safety and success--an imperative goal highlighted by the recent recalls of the blockbuster drugs Vioxx and Baycol.

But try telling that to the drug companies, who keep their compound libraries under lock and key and for whom data sharing is a rare, tightly controlled practice. "The data represent a significant financial and intellectual asset. You're asking us to potentially give up very large, high-valued information," said Michael D. Miller, senior director of computational chemistry and molecular properties at Pfizer Global R&D in Groton, Conn.

Most people on both sides of the issue understand the dilemma.

"IF YOU SPEND 15 years and a billion dollars to make a compound, you're not going to give away your most important asset," noted Osman F. Güner, senior director of lead identification and optimization at Accelrys in San Diego.

Oprea's mission, then, is to try to ease worried minds. What would make such cooperative data-exchange ventures possible, he said, would be to devise computational ways to mask the structures of chemical compounds while still providing modelers with enough about the compound's properties to develop accurate predictive methods for absorption, distribution, metabolism, excretion, and toxicology (ADME/Tox) studies.

Is an "undeducible" structure, much like an unbreakable security code, possible? And would knowing only part of a compound's story, such as its general shape or molecular weight, be enough to develop a good model?

These questions have generated a lively debate. Oprea teamed up with prominent medicinal chemist Christopher A. Lipinski, an adjunct senior research fellow at Pfizer Global R&D, to organize a symposium on the possibility of the "safe exchange" of chemical information at the American Chemical Society national meeting in San Diego last month. The symposium was cosponsored by the Divisions of Chemical Information and of Computers in Chemistry.

The topic generated interest not just among chemists. Biologists, statisticians, and software designers also attended the symposium, suggesting strategies for making data "safe," ranging from fuzzying up data to carefully selecting the types of properties that can be disclosed to representing molecules as general shapes, such as topomers.

Running through the symposium was the basic argument of whether such safe sharing is an idealistic pipe dream, akin to unilateral nuclear disarmament, or the inevitable face of progress in chemoinformatics and drug discovery. Some speakers said that--given time, powerful computers, and luck--a determined hacker could ultimately deduce even the most fuzzily masked chemical structure. The flip side is the fear that if you make the data too fuzzy, then the models might be fuzzy as well. Others, though, firmly believe that structures can be kept safe enough, much as most trust that their online bank password won't be hacked.

Oprea's hope "is to get experts to agree that, within certain limits, you can devise methods to make it safe to extract chemical information without the possibility of reverse-engineering the structure," he said.

Many of the strategies presented at the meeting revolved around the use of molecular descriptors: properties such as molecular weight, number of hydrogen-bond donors and acceptors, or number of rotatable bonds that can be grouped together to help represent biological or pharmacological features of a molecule. A lot of work also goes into developing specialized descriptors that represent collections of properties such as connectivity and polarizability. The computation of these descriptors is no small task. The descriptors can then be used to model the behavior of compounds in biological settings.

Scientists debated the types of descriptors that can best convey the most relevant information and how many and what kinds could allow someone to reverse-engineer the molecule's structure. For example, certain combinations of descriptors tend to point toward unique compounds.

For Robert S. Pearlman, director of the Laboratory for Development of Computer-Assisted Drug Discovery Software at the College of Pharmacy of the University of Texas, Austin, the picture is quite simple: One need only lop off a few significant figures from a descriptor such as a partition coefficient, like log P, and then it's basically impossible to determine the structure.

Depending on its precision, molecular weight, for example, could be all that's needed to deduce isomers of a structure: There aren't many structural possibilities for a compound having molecular weight of, say, 424.972. But if the number is given as a range from 424.5 to 425.5, many more compounds could fit the bill.

"There's a huge number of compounds with molecular weight between 424.5 and 425.5," Pearlman said. "But we've got all the information we really need from a scientific perspective."

He points out that the accuracy of actual data can vary more than the accuracy of the descriptor, and so rounding off descriptors shouldn't interfere with the predictability of the models. "It doesn't make sense to worry about the uncertainty of a descriptor that might be part of the equation, any more than we worry about the uncertainty with which we were able to measure the property we're trying to predict with the equation," he said.

Ruben Abagyan, molecular biology professor at Scripps Research Institute, San Diego, introduced the idea of adding artificial noise to descriptors. Though pertinent information about the compounds would be retained, "it would be practically impossible to guess the real molecule," he said. And, he pointed out, some compounds have enough tautomers to "confuse the enemy."

Other types of descriptors have potential to serve as safe exchange media, such as Screens (sets of molecular fragments), described by Nikolay Osadchiy of ChemDiv, San Diego, and SIBAR (similarity-based structure-activity relationship) descriptors described by Gerhard F. Ecker, pharmaceutical chemistry professor at the University of Vienna.

Oprea's group has tested a number of descriptor systems on structures from Sunset Molecular Discovery's WOMBAT (World of Molecular Bioactivity), a database of molecules and their associated biological properties collected from medicinal chemistry literature. They found that structures can't always be deduced from the descriptors--and that implies, they said, that it would be possible to create a safe descriptor set. Oprea has submitted a proposal to the National Institutes of Health, as part of NIH's Roadmap program, to develop just such a set of uncrackable descriptors.

Vladimir V. Poroikov, head of the Laboratory for Structure-Function-Based Drug Design at the Russian Academy of Medical Science's Institute of Biomedical Chemistry, is more pessimistic, however. "Any meaningful information about chemical structures can be used to search for either a particular compound itself or its close analogs," he said. 


If you don't believe it's possible to conceal a molecule's structure in a collection of chemical information, Tudor Oprea would like you to step up to the plate.

Oprea, biocomputing professor at the University of New Mexico School of Medicine, Albuquerque, and founder of Sunset Molecular Discovery, is challenging computational chemists to reverse-engineer chemical structures from a set of molecular descriptors of their choosing.

As part of the ChemMask project, Oprea's group will generate the contestant's descriptors for structures associated with known properties, in most cases taken from Sunset's WOMBAT database. They'll send the descriptors to the contestants, who will then try to deduce the molecules' structures.

Oprea announced the contest at a symposium on the possibility of the safe exchange of chemical information at the American Chemical Society national meeting in San Diego last month. The symposium was cosponsored by the Divisions of Chemical Information and of Computers in Chemistry.

Interested modelers can go to the ChemMask website (pimento.health.unm.edu/index.html) or contact Oprea directly.

"I hope we might generate enough buzz that some in the industry might be willing to look into this," Oprea said.

THE ABILITY to deduce the structure depends only on the number of descriptors and database coverage, he said. He cited his group's experiments using the National Cancer Institute's (NCI) database of compounds and biological activities, which "clearly demonstrated that using only such simple descriptors as molecular weight and log P makes it possible to identify either the molecule itself or at least the chemical class of interest," he said.

Poroikov is also concerned that methods that rely on decreasing the accuracy of descriptors will yield "a dramatic decrease in the accuracy of models," he said.

Adding more complexity to the picture is the fact that drug companies aren't the only ones with secrets to hide. Software developers also want to guard their descriptors, lest their modeling programs also be reverse-engineered. "When we first introduced predictive models as black boxes, we were very concerned about revealing which descriptors we used in building models," Güner said. He proposed that both could keep their information secret in the following scenario: Companies could evaluate models by doing calculations on a large number of public descriptors, then sending those files to the modelers. The modelers would use the descriptor combination they'd developed to run their model, then send those results back to the company.

A LOT OF IMPORTANT information about a molecule's biological activity is contained very generally in its shape. A long-used molecular representation known as a topomer could be a useful tool, said Richard D. Cramer III, chief scientific officer of Tripos, St. Louis. Cramer originally developed topomers to address shape similarity for the very large virtual data sets used in combinatorial chemistry. Though topomer representations are proprietary, they would work well for safe exchange, Cramer said. "Shape similarity is not enough of a description to get back chemical structure," he said.

Anthony Nicholls, president and chief executive officer of OpenEye Scientific Software, Santa Fe, N.M., also leans toward shape and electrostatic profile as safe encoders of chemical information.

Meeting-goers noted that, while they argue over data sharing with companies, a wealth of public chemical information is available for the tapping, from public databases such as NIH Roadmap's PubChem and NCI's database to the Science IP service from Chemical Abstracts Service, which archives literature on some 25 million compounds.

But proprietary data are still the most coveted, and many scientists are trying to convince companies that sharing would be a benefit for them, too. For example, Lipinski explained, company scientists might want to try some software that could give them useful predictions about a compound, but they can't send the structure out over the Internet, so maybe they send known compounds from the literature. "But that's not terribly satisfying," he said. "You never know how it will work with the compounds you really care about."

Companies like Pfizer, with large groups of in-house modelers, don't think that helps much. "As a scientist, I have my doubts as to whether you could ever really guarantee that something could be unentangleable," Miller pointed out.

Some meeting participants observed that the debate still reflects an inevitable shift toward the acceptance of data exchange. Despite his doubts that data sharing can ever be made truly safe, Poroikov believes that companies will eventually relax their policies, "taking into account the increasing safety requirements of society and the fact that 'pharma people' can also become patients."

Some researchers believe the paradigm is already starting to shift. For example, Lhasa Ltd., in Leeds, England, is a nonprofit organization that curates toxicology and metabolic information. Lhasa's Derek and Meteor databases draw on a wide variety of resources, including proprietary data from collaborating companies. Without revealing confidential information, Lhasa still provides scientists with possible links between toxicity and metabolism and structure. "Our business is to try to be as open as we can," said Philip N. Judson, chief scientist at Lhasa.

"I'm hoping that, even though there's a very small chance that the structure can be figured out, pharmaceutical companies won't hesitate to share data in exchange for value they'll be getting back: better predictive models and the savings of millions of dollars," Güner said. He likened the risk to the probability that conference-goers would be hit by lightening when they left the room. The possibility, though remote, does exist. Yet, "we are not staying inside the house all the time," he said.

Lipinski pointed out that the ultimate predictor of safety may be cost. "If it costs more to crack the code of descriptors than the [actual] value of the compound, then it will probably not be done," he said. "It's true in the real world, but will it carry weight with the lawyers?"

Ultimately, though, an even more serious problem confronts modelers: the variability of data from different sources, not only from lab to lab, but from day to day within the same lab. "Any biologist in a pharmaceutical company will tell you it's a challenge for a scientist to measure the affinity of a given compound for a particular target on two different days in their own lab and get the same value," Pearlman said.

The only way to surmount that problem would be to completely standardize the industry. Though such a task will be enormous, it does have precedent: The methods for determining melting points, for example, are relatively consistent, and results vary little from lab to lab. "But clearly, standardization of data will be the next bottleneck," Güner said.

  Chemical & Engineering News
ISSN 0009-2347
Copyright © 2005

E-mail this article
to a friend
Print this article
E-mail the editor