About MDD - Subscription Info
March 2002
Vol. 5, No. 3, pp 41, 43–45.
Table of Contents
MDD Home
sites and software

Picking out the pattern

Software such as the popular BLAST program is becoming a basic tool in bioinformatics.

opening artIn essence, bioinformatics uses computers and applies techniques of information technology to biotechnology. Fast computers have enabled much of the gene identification by providing a method of flexible searching through boatloads of data.

Flexible searching allows users to relax constraints on their search strings. By analogy, let’s say that you have a database of animals, and you want to search for pictures of dogs. But your search term is a picture of a collie. With an exact match, you would not retrieve pictures of many dogs, only pictures of collies. That would be fine for Lassie aficionados, but it would be disappointing for those who love Rin Tin Tin. However, with a more flexible search scheme, you could put in a general picture of a dog and retrieve all dog examples from the animal database.

Pattern matching
Most people who use computers understand and use pattern matching in some form. Search engines on the Web use pattern matching to locate information of interest. Patterns can be specific or quite general, using various wildcards that match multiple endings, words, or strings. Many databases have a similar capability, in which a character such as an asterisk is used as a wildcard.

This concept can be extended in many directions. For example, if you wanted to search for information related to “molecules”, you might want to include terms such as “molecule” and “molecular”. But rather than typing all possibilities, you could add an asterisk and search for “molecul*” to retrieve all three possibilities at once. In the Chemical Abstracts Service Molecular Database, researchers can use Markush structures attached to drawings of molecules to retrieve a base molecule, such as benzene, with any string attached, such as “OH”, “COOH”, and “Cl”. Most bioinformatics databases have similar pattern-matching capabilities. In bioinformatics, flexible pattern matching is called similarity searching.

Similarity is a percentage sequence match between nucleotide or protein sequences. The hypothesis is that similarity relates to functionality—if two sequences are similar, they will have related functionalities. Homology is a special type of similarity that results from having a common ancestry. Homology may provide no new information about functionality.

The genes for various functions, such as vision, may have similar sequences across different species. However, because evolution and mutation occur at different rates in different species, the exact sequences will vary, and so a perfect match is neither possible nor desirable. Only a similarity match makes sense for searching among nucleotides, genes, or proteins of similar functions. The probability of a match provided by similarity searches yields a more useful quantity for comparing sequences than an exact match.

There are many ways to achieve the pattern matching used in similarity searching. The basic local alignment search tool (BLAST) uses a maximal segment pair (MSP) measure. FASTA approximates results between BLAST and a Smith–Waterman algorithm. Other systems use methods such as hidden Markov models (HMMs) and neural networks (NNs) for pattern matching. Let’s review these pattern-matching methods in more detail and then look at some of the biotechnology systems that use them.

Various methods
The MSP calculation is a local similarity algorithm that seeks relatively constrained subsequence alignments. An MSP is the highest-scoring pair of segments of identical length chosen from two aligned sequences. BLAST heuristically calculates the MSP to provide a measure of local similarity for any pair of sequences.

Figure 1. (Top) BLAST searches to match three units in the query string
Figure 1. (Top) BLAST searches to match three units in the query string. If it finds a match, then (bottom) it fills in more detail from the query string to test for further matching.
BLAST. BLAST uses a general algorithm for pattern matching. It employs a simple rule-based comparison that looks for an exact match of three units in the query to the sequences in the database. When a match is found within some predefined threshold of fidelity, then the match is tested bidirectionally to locate the longest matching string. Matches that score better than the threshold result in a locally optimal ungapped alignment that BLAST reports as a valid retrieval.

Smith–Waterman algorithm. The Smith–Waterman algorithm, a complex matrix optimization relationship, generates an amino acid sequence alignment for sequence similarity.

This algorithm is used to search databases for sequences similar to a query sequence. It employs dynamic programming to determine an optimal alignment between the query sequence and a database sequence. This alignment is obtained by determining what transformations the query sequence would need to undergo to match the database sequence.

Transformations include substitution, insertion, and deletion of elements. Scores are assigned for comparisons, using positive numbers for exact matches and negative numbers for some transformations. Scores are obtained from statistically derived scoring matrices. The combination of transformations that results in the highest score is used to generate an alignment between the query sequence and database sequence (www.paracel.com/).

HMM. An HMM is used to statistically describe a protein family’s foundation sequence. It is used to compare protein sequences of known families by scoring the probability of a match being followed by insertions/deletions or the reverse. In addition, HMMs allow insertion-to-deletion transitions (and vice versa) and scoring of begin and end states to control whether a search is run globally or locally. HMM searches are similar to standard profile searches in that they perform position-dependent gap scoring. This statistical description can be used for sensitive and selective database searching. HMMs are well-studied algorithms used in fields as diverse as electronics for filtering and computer science for speech processing. They are fast and easy to train, and they can use position-specific scoring (www.paracel.com/).

NNs. A NN is an information-processing algorithm inspired by older models of how the brain works to process information. A NN is implemented as a collection of interconnected processing elements that are analogous to biological neurons and are tied together with weighted connections that are analogous to synapses. The weights of the connections are changed by training the NN on known patterns. The NN learns the desired patterns and stores them in its memory in the form of the weighted connections.

When the NN is presented with an unknown pattern, it compares the resulting connection weights to those of its trained configuration. When the difference between the known and unknown weights falls within a predefined threshold, the NN establishes a match. The calculations for this threshold can be quite sophisticated, allowing for robust pattern matching.

There are many other types of pattern-matching algorithms, but these examples and the following tools provide a good sample of pattern matching in bioinformatics.

Bioinformatics tools
BLAST is not one tool, but a collection of flexible pattern-matching search routines that compare queries with data in sequence databases. BLAST uses a general set of rules to compare specific regions of similarity for a given search string. BLAST is more than a precision matching tool. It provides a method for comparing the structures and functions of samples to genetic sequences and proteins found in the National Center for Biotechnology Information databases.

Rather than providing a global match, BLAST detects sequence similarities at a smaller, more local region of interest (www.ncbi.nlm.nih.gov/Education/blasttutorial.html).

Different versions of BLAST will search for various string characteristics. For example, Gapped-BLAST searches for insertions and deletions in a genetic sequence. PSI-BLAST allows iterative cycles of searching. PHI-BLAST lets the user define mandatory local patterns for a given sequence.

BLAST options. The collection of BLAST tools supports many different search options. Blastp takes an amino acid query sequence from the user and compares it with a protein sequence database. Blastn uses a nucleotide query sequence to retrieve information from a nucleotide sequence database. Tblastn uses an (unknown) protein query sequence to retrieve information from a nucleotide sequence database. Blastx compares an (unknown) nucleotide sequence with a protein sequence database. Tblastx uses a computationally intensive, flexible pattern match scheme to compare a general nucleotide query sequence with generalizations of a nucleotide sequence database. The last three tools allow for general template matching, which is used when the query sequence is not well defined.

A general template match allows users to search when they have incomplete information about a protein or nucleotide query sequence. For example, if the user knows the end points of a sequence but not the details in between, he or she can use these tools for searching. By analogy, imagine looking for a three-letter word when you know the first and last letters but not the middle, such as P*N. You can submit a search and retrieve several hits, such as PAN, PEN, PIN, PON, PUN, and PYN, and use other information to narrow down the choices. The BLAST tools provide this kind of capability and more. For example, with a relaxed constraint, P*N could retrieve PAIN, PLANTAIN, and so on.

Another kind of template match is called gap matching, which allows a researcher to match sequences with possible insertions or deletions. Using gap matching, P*N might retrieve PENCIL or PN. One reason that gaps are important in sequence matching is that they may indicate the presence of a mutation, opening a door to the discovery of new functionality. Therefore, the existence of a gap may be more important than its length.

Gap analysis. Gap matching is more important than first glance might reveal. Most pattern matching compares two things that are alike, such as two mammals, two birds, or even two aldehydes. But what if I am interested in similar functions? I might wish to compare the evolution of wings in birds, bugs, bats, pterodactyls, and manta rays.

An aerodynamicist might compare flow, a biologist might examine structure, and a geneticist might want to locate and compare genetic structure. However, the genes for wings vary greatly from species to species. Because of the different evolutionary paths and locations along the path (especially if we could extract pterodactyl DNA from amber), the variation in the genes for flight can be significant. Similarity searching using gap matching comes to the rescue.

For a given function, there is a good possibility of local matches in the genes. A bat may not match a bug, but there should be some analogue in each species. Fast computers and programs such as BLAST allow researchers the freedom to explore the potential for these common patterns. At the same time, gaps or insertions can be matched as a null, blank, or wildcard.

It is like matching two decks of cards. If you lay out all 52 cards, you can map another deck to them. If some cards are missing, the gaps are revealed. If instead you lay out an incomplete deck of 40 cards and map a full deck to it, then you will have to insert spaces to account for the 12 missing cards. Regardless of the gaps or insertions, the map is created. The same idea is behind gap matching.

FASTA. FASTA offers many of the same functionalities as BLAST. Although BLAST tools are faster, FASTA provides more accurate sequence alignments. BLAST uses a different algorithm, but its results are similar to those found by FASTA.

FASTA is superior to BLAST for translated DNA–protein comparison and DNA database searches because it calculates a single alignment that allows frameshifts. In contrast, BLAST performs forward-frame searches separately. By treating forward-reading frames as a single sequence, FASTA makes it much easier to produce high-quality alignments that extend the length of the protein sequence, resulting in improved sensitivity.

FASTA is a little more flexible than BLAST for DNA sequence searches. It provides small word sizes to accommodate polymerase chain reaction primers having short sequences. And FASTA uses several different scoring matrixes to help identify sequences of varying lengths.

PSIPRED. Position Specific Iterated Predict Secondary Structure (PSIPRED) is a simple and reliable secondary structure prediction method that uses neural networks to perform an analysis on output obtained from PSI-BLAST. The algorithm can average the output from up to four separate neural networks in the prediction process to further increase structure prediction accuracy (http://bioinf.cs.ucl.ac.uk/psipred/).

(For more on protein homology, see "Going for Fold in Asilomar" in the Nov/Dec 2000 Modern Drug Discovery.)

Hank Simon has worked with artificial intelligence and knowledge discovery systems for 22 years. He writes and teaches in northern California. Send your comments or questions regarding this article to mdd@acs.org or the Editorial Office by fax at 202-776-8166 or by post at 1155 16th Street, NW; Washington, DC 20036.

Return to Top || Table of Contents