Lecture notes: Molecular Bioinformatics 2001, Uppsala University

Lecture 22 Jan 2001 Per Kraulis

Sequence patterns using regular expressions (such as PROSITE) have
**a problem with large multiple alignments** of divergent
families: As more sequences are added, the probability that there will
be even a few constant or even strongly conserved sites will
diminish. **There will always be an exception to the
rule**. In order to avoid missing a known member of a family,
the regexp has to be made more general, but then the danger of
including garbage increases. This is the typical
sensitivity-specificity problem.

There is another approach. **Sequence profiles**
(Gribskov
et al 1987) are essentially patterns where **each position
in the sequence of the segment (or motif) has been assigned a
probability** value for each possible amino-acid residue
type. Instead of requiring a yes/no response to the question "does the
amino acid in the sequence fit the pattern?", we now get a response
"it fits at a level of 0.9", or "it fits at level of 0.1". The idea is
to make the process softer. Add together the soft responses to an
overall sum and then make a decision. Don't make the decision at each
comparison step.

One can use **an analogy**: An exam for students can be
designed so that a correct answer is required for each and every
question in the exam, although each question may be fairly
simple. This corresponds to the regular expression approach. Another
type of exam gives points for each correct answer, sums up all points
at the end, and decides whether the student has passed or not based on
the sum. This corresponds to profiles. A student may be unable to
answer one particular question, but can make up for it by answering
other questions correctly.

This approach works as if a substitution matrix had been defined for
each position in the sequence. This **requires that the
alignment contains many sequences**, which should be as varied
as the family really is. Good statistics is necessary. If some parts
of the family tree are missing, then the profile will not give members
from that part of the tree high scores. It is therefore common to add
in information from a Dayhoff-type substitution matrix (or similar);
this is like mixing a pure position-dependent matrix with a pure
general substitution matrix.

Copyright © 2001 Per Kraulis $Date: 2001/01/22 12:14:20 $