Stockholm Bioinformatics Center, SBC
Lecture notes: Structural biochemistry and bioinformatics 2001

Lecture 13 Nov 2001, Per Kraulis

Hidden Markov models

1. Regular expressions

A multiple alignment is very useful for a lot of different analysis, such as identifying patterns in the sequence that have some connection with the structure and/or the function of the protein domain. However, a large multiple alignment is difficult to handle, and we may want to condense the information in it down to the bare minimum.

One way of doing that is to extract an explicit description of the pattern of conserved residues that we can identify in a multiple alignment. How can we represent a pattern of residues as found in a multiple alignment? And how can we use such a pattern to search for it in other protein sequences?

Computer sciencists have devised a formalism to describe the kind of patterns we need: regular expressions (svenska: "reguljära uttryck"). Sometimes the term is abbreviated to regexp. Regular expressions can be used to describe languages of a particular, restricted kind. Ordinary human languages do not fall into this category; they are too complex.

In our case, let us view the sequence of a protein (or DNA) as a sentence in a specific, small language. We can then define a particular regular expression (also called a grammar) that fits the given protein sequence. This regexp can then be used to test other sequences, to see whether they fit the pattern or not.

It is essential to understand that either a string fits a regular expression, or it doesn't. There is no in between. This is both a strength and a weakness: It makes it possible to implement regular expression searches in a very efficient manner (using algorithms that behave as so-called finite state automata). On the other hand, if the pattern is wrong about just one single character, then it will fail to match strings that ought to be found. There is no such thing as 'nearly matched' for a regular expression.

It is of course not necessary to derive a regular expression from a given real sequence. One can make it up. For example, maybe we want to know if there are any sequences in SWISS-PROT that contain the tetrapeptide Cys-Xxx-Cys-His, where Xxx is any amino-acid residue. Then we can design a regular expression for this pattern, and test all sequences against it.

Regular expressions are very useful in solving many different tasks involving text searching and pattern recognition. A standard notation for describing regular expressions has evolved in the UNIX world, for example in standard system programs such as grep ('General REgular exPression'; for searching in text files). This notation employs a number of special characters to specify particular patterns. The ordinary alphanumerical characters just specify themselves, without any special meaning.

Here are a number of UNIX-type regular expressions for different patterns, and the strings that were searched. This is not a complete list; check out the documentation for the modules that you are using. There are even entire books describing all the intricate details of regular expressions. Although the basic regular expressions look simple enough, it can be a quite difficult task to design a regexp that accurately implements a complicated pattern, particularly if the pattern contains complex repetitive elements.

Regular expression Description String Result Comment
bc Strings of ordinary alphanumerical characters match literally. abcde Matches! Proper substring
ac No match Not a proper substring
b.d Dot '.' matches any single character. abcde Matches! Substring bcd matches
a.d No match No substring having characters 'a-any-d'.
^abc Caret (or hat) '^' matches the beginning of a string. abcde Matches! The substring is at the beginning of the string.
^bcd No match The substring is not at the beginning.
abc$ Dollar '$' matches the end of a string. abcde No match The substring is not at the end of the string.
cde$ Matches! The substring is at the end.
a[bcd]e Characters within brackets '[]' are choices: match one of them. ade Matches! 'd' is one of the possible choices.
aee No match 'e' is not among the choice characters.
ab?c A question mark '?' matches the preceding pattern zero or one times. abcde Matches! 'b?' matched one time.
acde Matches! 'b?' matches zero times.
abbcde No match More than one 'b'.
ab*d Multiplication sign '*' matches zero or more repetitions of the preceding pattern. abd Matches! One b matches 'b*'
abbbd Matches! Three b's match 'b*'
ad Matches! No 'b' is necessary.
acd No match The 'c' makes a match impossible.
ab+d Plus sign '+' matches one or more repetitions of the preceding pattern. abbbd Matches! Three 'b' fits the pattern.
ad No match There must be at least one 'b'.
ab{2,4}c Curly brackets '{}' match the preceding pattern the between the specified minimum or maximum number of times. abcd No match Just one 'b'; not enough.
abbcd Matches! Two 'b' fits the pattern.
abbbbbcd No match Five 'b' are too many.

There are many software modules available for performing searches in text data using a regular expressions. For example, the script language Perl (often used in bioinformatics) has regular expression handling as a sophisticated built-in feature, and the object-oriented script language Python has a module re for regular expression handling.

The technical details of using regular expressions in these languages can be very complex (depending on what you want to do). It is possible to create regexps that are terribly comples, and are very difficult to analysis 'by hand'. If you need some complex pattern recognition the you need to be very careful, and test the regexp extensively.

Copyright © 2001 Per Kraulis $Date: 2001/11/09 15:30:28 $