Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Jarek Meller
Division of Biomedical Informatics,
Children’s Hospital Research Foundation
& Department of Biomedical Engineering, UC
JM  http://folding.chmcc.org
JM  http://folding.chmcc.org
With the knowledge about STSs and physical maps
(hopefully) acquired last week we can have another look
at the NCBI Map Viewer:
http://www.ncbi.nlm.nih.gov/mapview
http://www.ncbi.nlm.nih.gov/genome/guide/human/
JM  http://folding.chmcc.org
http://wwwigm.univmlv.fr/~lecroq/string/index.html
Christian Charras and Thierry Lecroq
JM  http://folding.chmcc.org
Phase 1: Preprocessing of the “text”
A string with m characters has m suffixes, which can be represented
as m leaves of a rooted directed tree. Consider for example T=cabca
c
b
$
a
4
c
a
b
a
c
$
b
$
a
5
c
$
3
1
a
$
2
For simplicity one leaf, due to the terminal character $, is not included.
Problem What is the reason for adding the terminal character?
JM  http://folding.chmcc.org
Phase II: Search
A substring of a string is a prefix of a suffix in that string. For example,
a substring P=ab is a prefix of the suffix abca in T=cabca. Thus, if P
occurs in T there is a leaf in the suffix tree that has a label starting with P.
c
b
$
a
4
c
a
b
a
c
$
b
$
a
5
c
$
3
1
a
$
2
Problem Does the size of the alphabet matter (and if so, how)?
Hint: how many edges may originate in a node, given that label of each
edge out of a node has to start with a different character?
JM  http://folding.chmcc.org
Consider for example two strings: T=cabca and U=bbcb.
U4
$
U3
$
$
c
b
b
U2
b
a
$
T4
c
a
a
b
$
c
b
b
$
a
T5
c
$
c
T3
b
T1
a
$
$
U1
T2
Remark By building the generalized suffix tree for a set of k strings of the total
length m one can find the longest prefixsuffix match for all pairs of strings in
O(m+k2) time (an additional trick is required for that).
JM  http://folding.chmcc.org
Hierarchical sequencing: physical maps, clone libraries and shotgun
(see Chapter 2 in “A Primer on Genome Science” by Gibson and Muse)
Definition The algorithmic problem of shotgun sequence assembly
is to deduce the sequence of the DNA string from a set of sequenced
and partially overlapping short substrings derived from that string.
Analogy to physical map assembly: DNA sequence of a substring may
be viewed as a precise ordered fingerprint (in analogy to STSs) and the
suffixprefix match determines if two substrings would be assembled
together.
In general, the shortest superstring problem (find the shortest string
that contains each string from a certain set of strings as its substring)
is NPhard and heuristics are being developed to address the problem.
JM  http://folding.chmcc.org
JM  http://folding.chmcc.org
An example: two globinlike sequences:
MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE
MLS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE
DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFIAIIHVLHSRH
DLKKHG TVLTALG ILKKKGHHEAE KP AQSHATKHKIP+KYLEFI I VL S+H
DLKKHGATVLTALGGILKKKGHHEAEKPLAQSHATKHKIPVKYLEFISECIQVLQSKH
PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG
PG+FGADAQGAMNKALELFRKD+A+ YKE
PGDFGADAQGAMNKALELFRKDMASNYKE
JM  http://folding.chmcc.org
g(g) =  g d
for a gap of length g and constant d
g(g) =  [ d + (g 1) e ]
where d is opening gap penalty and e an extension gap penalty.
JM  http://folding.chmcc.org
JM  http://folding.chmcc.org
All the possible alignments (with gaps) may be represented in the
Form of a DP graph (DP table). Consider an example with two
strings of length 2:
\ a1b1a2b2
\
 b1a1b2a2
\_
0
1
1
\ \ _
_  a1b1b2a2
_ a1a2b1b2
\

1
3
5
_ _
\  b1a1a2b2
\
1
5
13
 _ _ _ _ _ _ _
_ _  _ _   b1b2a1a2
  _
JM  http://folding.chmcc.org
Definition A string of length n+m, obtained by intercalating two strings S1 = a1a2…anandS2 = b1b2…bm , while preserving the order of the symbols in S1andS2, will be referred to as an intercalated string and denoted by S1/2. Note that S1 and S2are subsequences of S1/2 but in general they are not substrings of S1/2.
Definition Two alignments are called redundant if their score is identical. The relationship of “having the same score” may be used to define equivalence classes of nonredundant alignments.
For example, the class a1b1b2a2:
a1b1b2a2 a1a2 a1a2
b1b2 ; b1b2
JM  http://folding.chmcc.org
Lemma There is onetoone correspondence (bijection) between the set of the nonredundant gapped alignments of two strings S1andS2and the set of the intercalated strings {S1/2}.
Corollary The number of nonredundant gapped alignments of two strings, of length n and m, respectively, is equal to (n+m)!/[m!n!].
Proof Since the order of each of the sequences is preserved when intercalating them, we have in fact n+m positions to put m elements of the second sequence (once this is done the position of each of the elements of the first sequence is fixed unambiguously). Hence, the total number of intercalated sequences S1/2is given by the number of melement combinations of n+m elements and the corollary is a simple consequence of the onetoone correspondence between alignments and intercalated sequences stated in the lemma. QED
JM  http://folding.chmcc.org
Problem Consider for simplicity two strings of the same length and using the Stirling formula (x! ~ (2p)1/2 xx+1/2 ex ) show that:
(n+n)!/[n!n!] ~ 22n / (2pn)1/2
Note that for a very short by biology standards sequence of length n=50 one needs to perform about 1030 basic operations for an exhaustive search, making the naïve approach infeasible.
Dynamic programming provides in polynomial time an optimal solution for a class of optimization problems with exponentially scaling search space, including the approximate string matching.
JM  http://folding.chmcc.org