Introduction to bioinformatics lecture iv sequence similarity and dynamic programming
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture.

Download Presentation

Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to bioinformatics lecture iv sequence similarity and dynamic programming

Introduction to Bioinformatics: Lecture IVSequence Similarity and Dynamic Programming

Jarek Meller

Division of Biomedical Informatics,

Children’s Hospital Research Foundation

& Department of Biomedical Engineering, UC

JM - http://folding.chmcc.org


Outline of the lecture

Outline of the lecture

  • Wrapping up the previous lecture: a quick look at the NCBI Map Viewer and suffix trees by way of an example

  • Inexact string matching: from generalizations of suffix trees to dynamic programming

  • The dynamic programming algorithm for sequence alignment: how it works

  • The dynamic programming algorithm for sequence alignment: why it works

  • Limitations and faster heuristic approaches

JM - http://folding.chmcc.org


Web watch ncbi map viewer

Web watch: NCBI Map Viewer

With the knowledge about STSs and physical maps

(hopefully) acquired last week we can have another look

at the NCBI Map Viewer:

http://www.ncbi.nlm.nih.gov/mapview

http://www.ncbi.nlm.nih.gov/genome/guide/human/

JM - http://folding.chmcc.org


Computationally efficient and elegant solutions for the exact string matching problem

Computationally efficient and elegant solutions for the exact string matching problem:

http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Christian Charras and Thierry Lecroq

JM - http://folding.chmcc.org


The idea of the suffix tree method

The idea of the suffix tree method

Phase 1: Preprocessing of the “text”

A string with m characters has m suffixes, which can be represented

as m leaves of a rooted directed tree. Consider for example T=cabca

c

b

$

a

4

c

a

b

a

c

$

b

$

a

5

c

$

3

1

a

$

2

For simplicity one leaf, due to the terminal character $, is not included.

Problem What is the reason for adding the terminal character?

JM - http://folding.chmcc.org


Suffix tree based matching why does it work

Suffix tree based matching: why does it work?

Phase II: Search

A substring of a string is a prefix of a suffix in that string. For example,

a substring P=ab is a prefix of the suffix abca in T=cabca. Thus, if P

occurs in T there is a leaf in the suffix tree that has a label starting with P.

c

b

$

a

4

c

a

b

a

c

$

b

$

a

5

c

$

3

1

a

$

2

Problem Does the size of the alphabet matter (and if so, how)?

Hint: how many edges may originate in a node, given that label of each

edge out of a node has to start with a different character?

JM - http://folding.chmcc.org


Generalized suffix tree for a set of strings and the longest common substring problem

Generalized suffix tree for a set of strings and the longest common substring problem

Consider for example two strings: T=cabca and U=bbcb.

U4

$

U3

$

$

c

b

b

U2

b

a

$

T4

c

a

a

b

$

c

b

b

$

a

T5

c

$

c

T3

b

T1

a

$

$

U1

T2

Remark By building the generalized suffix tree for a set of k strings of the total

length m one can find the longest prefix-suffix match for all pairs of strings in

O(m+k2) time (an additional trick is required for that).

JM - http://folding.chmcc.org


Assembling dna from fragment and the suffix prefix matching problem

Assembling DNA from fragment and the suffix-prefix matching problem

Hierarchical sequencing: physical maps, clone libraries and shotgun

(see Chapter 2 in “A Primer on Genome Science” by Gibson and Muse)

Definition The algorithmic problem of shotgun sequence assembly

is to deduce the sequence of the DNA string from a set of sequenced

and partially overlapping short substrings derived from that string.

Analogy to physical map assembly: DNA sequence of a substring may

be viewed as a precise ordered fingerprint (in analogy to STSs) and the

suffix-prefix match determines if two substrings would be assembled

together.

In general, the shortest superstring problem (find the shortest string

that contains each string from a certain set of strings as its substring)

is NP-hard and heuristics are being developed to address the problem.

JM - http://folding.chmcc.org


Inexact or approximate string matching

Inexact or approximate string matching

  • Two major reasons for the importance of approximate matching in

  • computational molecular biology are:

  • Measurement (e.g. sequencing) errors and fuzzy nature of underlying

  • molecular processes (e.g. hybridization may occur despite some

  • mismatches)

  • ii) Redundancy in biology with evolutionary processes resulting in closely

  • related, yet, different sequences that require approximate matching in

  • order to detect their relatedness and identify variable as well as

  • conserved features that may reveal fingerprints of structure and function

  • Either generalizations of exact string matching methods, such as suffix trees,

  • or dynamic programming (or their heuristic combinations) are being used to

  • solve this problem.

JM - http://folding.chmcc.org


Redundancy in biological systems

Redundancy in biological systems

An example: two globin-like sequences:

--MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE

MLS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE

MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE

DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFI--AIIHVLHSRH

DLKKHG TVLTALG ILKKKGHHEAE KP AQSHATKHKIP+KYLEFI I VL S+H

DLKKHGATVLTALGGILKKKGHHEAE-KPLAQSHATKHKIPVKYLEFISEC-IQVLQSKH

PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG

PG+FGADAQGAMNKALELFRKD+A+ YKE

PGDFGADAQGAMNKALELFRKDMASNYKE-----

  • Note that there are two types of mismatches:

  • Due to point mutations

  • Due to insertions and deletions (gaps)

JM - http://folding.chmcc.org


Gap penalties evolutionary and computational considerations

Gap penalties: evolutionary and computational considerations

  • Linear gap penalties:

    g(g) = - g d

    for a gap of length g and constant d

  • Affine gap penalties:

    g(g) = - [ d + (g -1) e ]

    where d is opening gap penalty and e an extension gap penalty.

JM - http://folding.chmcc.org


Dynamic programming algorithm for string alignment

Dynamic programming algorithm for string alignment

  • Our goal is to find an optimal matching for two strings S1 = a1a2…an and S2 = b1b2…bmover a certain alphabet S, given a scoring matrix s(a,b) for each a and b in Sand (for simplicity) a linear gap penalty

  • Relation to minimal edit distance (number of insertions, deletions and substitutions required to transform one string into the other) problem

  • The similarity measure (scoring matrix) should represent biological relatedness and separate true matches from random alignments (find more in Chapter 2 of “Biological Sequence Analysis” by Durbin et. al.)

JM - http://folding.chmcc.org


How many alignments are there

How many alignments are there?

All the possible alignments (with gaps) may be represented in the

Form of a DP graph (DP table). Consider an example with two

strings of length 2:

\  a1b1a2b2

\

| b1a1b2a2

\_

0

1

1

\ \ _

|_ |  a1b1b2a2

_ a1a2b1b2

\

|

1

3

5

|_ _

\ |  b1a1a2b2

\

1

5

13

| _ _ |_ _ |_ _ _

|_ _ | |_ |_ | | b1b2a1a2

| | |_

JM - http://folding.chmcc.org


Computing the number of alignments with gaps

Computing the number of alignments with gaps

Definition A string of length n+m, obtained by intercalating two strings S1 = a1a2…anandS2 = b1b2…bm , while preserving the order of the symbols in S1andS2, will be referred to as an intercalated string and denoted by S1/2. Note that S1 and S2are subsequences of S1/2 but in general they are not substrings of S1/2.

Definition Two alignments are called redundant if their score is identical. The relationship of “having the same score” may be used to define equivalence classes of non-redundant alignments.

For example, the class a1b1b2a2:

a1b1b2a2  a1-a2 a1a2-

b1b2- ; b1-b2

JM - http://folding.chmcc.org


Computing the number of alignments with gaps1

Computing the number of alignments with gaps

Lemma There is one-to-one correspondence (bijection) between the set of the non-redundant gapped alignments of two strings S1andS2and the set of the intercalated strings {S1/2}.

Corollary The number of non-redundant gapped alignments of two strings, of length n and m, respectively, is equal to (n+m)!/[m!n!].

Proof Since the order of each of the sequences is preserved when intercalating them, we have in fact n+m positions to put m elements of the second sequence (once this is done the position of each of the elements of the first sequence is fixed unambiguously). Hence, the total number of intercalated sequences S1/2is given by the number of m-element combinations of n+m elements and the corollary is a simple consequence of the one-to-one correspondence between alignments and intercalated sequences stated in the lemma. QED

JM - http://folding.chmcc.org


Computing the number of alignments with gaps2

Computing the number of alignments with gaps

Problem Consider for simplicity two strings of the same length and using the Stirling formula (x! ~ (2p)1/2 xx+1/2 e-x ) show that:

(n+n)!/[n!n!] ~ 22n / (2pn)1/2

Note that for a very short by biology standards sequence of length n=50 one needs to perform about 1030 basic operations for an exhaustive search, making the naïve approach infeasible.

Dynamic programming provides in polynomial time an optimal solution for a class of optimization problems with exponentially scaling search space, including the approximate string matching.

JM - http://folding.chmcc.org


  • Login