From Suffix Trees to Approximate String Matching

Introduction to Bioinformatics: Lecture IV
Sequence Similarity and Dynamic Programming
Jarek Meller
Division of Biomedical Informatics,
Children’s Hospital Research Foundation
& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org
1
Outline of the lecture





Wrapping up the previous lecture: a quick look at the
NCBI Map Viewer and suffix trees by way of an
example
Inexact string matching: from generalizations of suffix
trees to dynamic programming
The dynamic programming algorithm for sequence
alignment: how it works
The dynamic programming algorithm for sequence
alignment: why it works
Limitations and faster heuristic approaches
JM - http://folding.chmcc.org
2
Web watch: NCBI Map Viewer
With the knowledge about STSs and physical maps
(hopefully) acquired last week we can have another look
at the NCBI Map Viewer:
http://www.ncbi.nlm.nih.gov/mapview
http://www.ncbi.nlm.nih.gov/genome/guide/human/
JM - http://folding.chmcc.org
3
Computationally efficient and elegant solutions for
the exact string matching problem:
http://www-igm.univ-mlv.fr/~lecroq/string/index.html
Christian Charras and Thierry Lecroq
JM - http://folding.chmcc.org
4
The idea of the suffix tree method
Phase 1: Preprocessing of the “text”
A string with m characters has m suffixes, which can be represented
as m leaves of a rooted directed tree. Consider for example T=cabca
b
c
a
$
3
c
a
$
5
$ 4
a
b
c
b
a
c
$
a
1
$
2
For simplicity one leaf, due to the terminal character $, is not included.
Problem What is the reason for adding the terminal character?
JM - http://folding.chmcc.org
5
Suffix tree based matching: why does it work?
Phase II: Search
A substring of a string is a prefix of a suffix in that string. For example,
a substring P=ab is a prefix of the suffix abca in T=cabca. Thus, if P
occurs in T there is a leaf in the suffix tree that has a label starting with P.
b
c
a
$
3
c
a
$
5
$ 4
a
b
c
b
a
c
$
a
1
$
2
Problem Does the size of the alphabet matter (and if so, how)?
Hint: how many edges may originate in a node, given that label of each
edge out of a node has to start with a different character?
JM - http://folding.chmcc.org
6
Generalized suffix tree for a set of strings and the
longest common substring problem
Consider for example two strings: T=cabca and U=bbcb.
U2
$
U4
b
$
c
a
b
$
T3
b
c
c
b
a
a
$
T5
$
$
b
b
U3
T4
c
a
c
b
$
a
U1
$
T2
$
T1
Remark By building the generalized suffix tree for a set of k strings of the total
length m one can find the longest prefix-suffix match for all pairs of strings in
O(m+k2) time (an additional trick is required for that).
JM - http://folding.chmcc.org
7
Assembling DNA from fragment and the suffixprefix matching problem
Hierarchical sequencing: physical maps, clone libraries and shotgun
(see Chapter 2 in “A Primer on Genome Science” by Gibson and Muse)
Definition The algorithmic problem of shotgun sequence assembly
is to deduce the sequence of the DNA string from a set of sequenced
and partially overlapping short substrings derived from that string.
Analogy to physical map assembly: DNA sequence of a substring may
be viewed as a precise ordered fingerprint (in analogy to STSs) and the
suffix-prefix match determines if two substrings would be assembled
together.
In general, the shortest superstring problem (find the shortest string
that contains each string from a certain set of strings as its substring)
is NP-hard and heuristics are being developed to address the problem.
JM - http://folding.chmcc.org
8
Inexact or approximate string matching
Two major reasons for the importance of approximate matching in
computational molecular biology are:
i)
Measurement (e.g. sequencing) errors and fuzzy nature of underlying
molecular processes (e.g. hybridization may occur despite some
mismatches)
ii) Redundancy in biology with evolutionary processes resulting in closely
related, yet, different sequences that require approximate matching in
order to detect their relatedness and identify variable as well as
conserved features that may reveal fingerprints of structure and function
Either generalizations of exact string matching methods, such as suffix trees,
or dynamic programming (or their heuristic combinations) are being used to
solve this problem.
JM - http://folding.chmcc.org
9
Redundancy in biological systems
An example: two globin-like sequences:
--MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE
M LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE
DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFI--AIIHVLHSRH
DLKKHG TVLTALG ILKKKGHHEAE KP AQSHATKHKIP+KYLEFI
I VL S+H
DLKKHGATVLTALGGILKKKGHHEAE-KPLAQSHATKHKIPVKYLEFISEC-IQVLQSKH
PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG
PG+FGADAQGAMNKALELFRKD+A+ YKE
PGDFGADAQGAMNKALELFRKDMASNYKE-----
Note that there are two types of mismatches:
i) Due to point mutations
ii) Due to insertions and deletions (gaps)
JM - http://folding.chmcc.org
10
Gap penalties: evolutionary and
computational considerations

Linear gap penalties:
g(g) = - g d
for a gap of length g and constant d

Affine gap penalties:
g(g) = - [ d + (g -1) e ]
where d is opening gap penalty and e an extension gap penalty.
JM - http://folding.chmcc.org
11
Dynamic programming algorithm for string alignment

Our goal is to find an optimal matching for two strings S1 =
a1a2…an and S2 = b1b2…bm over a certain alphabet S, given a
scoring matrix s(a,b) for each a and b in S and (for simplicity) a
linear gap penalty

Relation to minimal edit distance (number of insertions,
deletions and substitutions required to transform one string into
the other) problem

The similarity measure (scoring matrix) should represent
biological relatedness and separate true matches from random
alignments (find more in Chapter 2 of “Biological Sequence
Analysis” by Durbin et. al.)
JM - http://folding.chmcc.org
12
How many alignments are there?
All the possible alignments (with gaps) may be represented in the
Form of a DP graph (DP table). Consider an example with two
strings of length 2:
\  a1b1a2b2
\
|  b1a1b2a2
a1
a2
\_
0
1
1
\
\_
b1
|_
|  a1b1b2a2
_  a1a2b1b2
1
3
5
\
|_
_
b2
|
1
5
13 \
| 
b1a1a2b2
\
| _ _ |_
_
|_ _
_
|_ _ |
|_ |_
|
|  b1b2a1a2
|
|
|_
JM - http://folding.chmcc.org
13
Computing the number of alignments with gaps
Definition A string of length n+m, obtained by intercalating two strings
S1 = a1a2…an and S2 = b1b2…bm , while preserving the order of the
symbols in S1 and S2, will be referred to as an intercalated string and
denoted by S1/2. Note that S1 and S2 are subsequences of S1/2 but in
general they are not substrings of S1/2.
Definition Two alignments are called redundant if their score is identical.
The relationship of “having the same score” may be used to define
equivalence classes of non-redundant alignments.
For example, the class a1b1b2a2:
a1b1b2a2 
a1-a2
b1b2- ;
a1a2b1-b2
JM - http://folding.chmcc.org
14
Computing the number of alignments with gaps
Lemma There is one-to-one correspondence (bijection) between the set of
the non-redundant gapped alignments of two strings S1 and S2 and the set
of the intercalated strings {S1/2}.
Corollary The number of non-redundant gapped alignments of two strings,
of length n and m, respectively, is equal to (n+m)!/[m!n!].
Proof Since the order of each of the sequences is preserved when intercalating
them, we have in fact n+m positions to put m elements of the second sequence
(once this is done the position of each of the elements of the first sequence is fixed
unambiguously). Hence, the total number of intercalated sequences S1/2 is given by
the number of m-element combinations of n+m elements and the corollary is a
simple consequence of the one-to-one correspondence between alignments and
intercalated sequences stated in the lemma. QED
JM - http://folding.chmcc.org
15
Computing the number of alignments with gaps
Problem Consider for simplicity two strings of the same length and using
the Stirling formula (x! ~ (2p)1/2 xx+1/2 e-x ) show that:
(n+n)!/[n!n!] ~ 22n / (2pn)1/2
Note that for a very short by biology standards sequence of length n=50
one needs to perform about 1030 basic operations for an exhaustive search,
making the naïve approach infeasible.
Dynamic programming provides in polynomial time an optimal
solution for a class of optimization problems with exponentially
scaling search space, including the approximate string matching.
JM - http://folding.chmcc.org
16

Download Report

From Suffix Trees to Approximate String Matching

Paperzz.com

Your Paperzz