Algorithms in bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 64

生物資訊相關演算法 Algorithms in Bioinformatics PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

生物資訊相關演算法 Algorithms in Bioinformatics. 呂學一 ( 中央研究院 資訊科學所 ) http://www.iis.sinica.edu.tw/~hil/. Today – 如虎添翼. An fundamental query that significantly strengthens suffix tree Range Minima Query (RMQ) 前翼 : RMQ for ±sequences. 後翼 : RMQ for general sequences.

Download Presentation

生物資訊相關演算法 Algorithms in Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Algorithms in bioinformatics

生物資訊相關演算法Algorithms in Bioinformatics

呂學一 (中央研究院 資訊科學所)

http://www.iis.sinica.edu.tw/~hil/

Algorithms in Bioinformatics, Lecture 6


Today

Today – 如虎添翼

  • An fundamental query that significantly strengthens suffix tree

    • Range Minima Query (RMQ)

      • 前翼: RMQ for ±sequences.

      • 後翼: RMQ for general sequences.

  • Intermission – 小巨’s magic show

    • “Amazing!”

Algorithms in Bioinformatics, Lecture 6


Road map

Road map

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ


Rmq range minima query

S: a sequence of numbers.

小(S, i, j) = k if

i ≤ k ≤ j, and

S[k] = min(S[i], S[i+1], …, S[i]).

123456789

S = 340141932

小(S, 2, 6) = 3

小(S, 4, 10) = 4 (or 6).

RMQ: Range Minima Query

Algorithms in Bioinformatics, Lecture 6


The rmq challenge

The RMQ challenge

  • Input: a sequence S of numbers

  • Output: a data structure D for S

  • Time complexity

    • Constant query time

      • Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

    • Linear preprocessing time

      • D can be computed in O(|S|) time.

Algorithms in Bioinformatics, Lecture 6


Na ve approach

Naïve approach

  • Storing the answer of 小(S, i, j) in a table for all index pairs i and j with 1 ≤i≤j≤ |S|.

  • Query time = O(1).

  • Preprocessing time = Ω(|S|2).

Algorithms in Bioinformatics, Lecture 6


Faster preprocessing

Faster Preprocessing

  • Assumption (without loss of generality)

    • |S| = 2k for some positive integer k.

  • Idea:

    • Precomputing the values of 小(S, i, j) only for those indices i and jwith j – i + 1 = 1, 2, 4, 8, …, 2k = |S|.

  • Preprocessing time

    • O(|S| log |S|).

Algorithms in Bioinformatics, Lecture 6


S i j still in o 1 time

Let k be the (unique) integer that satisfies 2k≤j – i + 1 < 2k+1.

Then, 小(S, i, j) is

x = 小(S, i, i + 2k – 1) or

y = 小(S, j – 2k + 1, j).

小(S, i, j) still in O(1) time

i

j – 2k + 1

i + 2k – 1

j

Algorithms in Bioinformatics, Lecture 6


As a result

As a result

  • RMQ

    • Input: O(n) numbers

    • Preprocessing: O(n log n) time

    • Query: O(1) time

  • RMQ

    • Input: O(n/log n) numbers

    • Preprocessing: O(n) time

    • Query: O(1) time

Algorithms in Bioinformatics, Lecture 6


Algorithms in bioinformatics

前翼

The RMQ Challenge for ±sequeneces

Algorithms in Bioinformatics, Lecture 6


Sequeneces

±sequeneces

  • S is a ±sequence if S[i] – S[i – 1] = ±1 for each index i with 2 ≤ i ≤ |S|.

  • For example,

    • S = 5 6 5 4 3 2 3 2 3 4 5 6 5 6 7

    • + - - - - + - + + + + - + +

    • S = 3 4 3 2 1 0 -1 -2 -1 0 1 2 1

    • + - - - - - - + + + + -

Algorithms in Bioinformatics, Lecture 6


The rmq challenge for sequeneces

前翼: The RMQ Challenge for ±sequeneces

  • Input: a ±sequenceS of numbers

  • Output: a data structure D for S

  • Time complexity

    • Constant query time

      • Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

    • Linear preprocessing time

      • D can be computed in O(|S|) time under the unit-cost RAM model.

Algorithms in Bioinformatics, Lecture 6


Unit cost ram model

Unit-Cost RAM model

  • Operations such as add, minus, comparison on consecutive O(log n) bits can be performed in O(1) time.

Algorithms in Bioinformatics, Lecture 6


Idea compression

Idea: compression

Any constant c < 1 is OK.

  • Breaking S into blocks of length L =½ log |S|.

    • There are B = 2|S|/log |S| blocks.

  • Let 縮[t] be the minimum of the t-th block of S.

    • 縮[t] = min {S[j] | j = (t – 1) L < j≤tL} for t = 1, 2, …, B.

    • Computable in O(|S|) time.

  • RMQ on 縮: 小(縮, x, y)

    • O(1) query time.

    • O(|S|) preprocessing time. (Why?)

Algorithms in Bioinformatics, Lecture 6


S i j via s t

Suppose S[i] is in the α-th block of S.

(α–1) L<i≤ αL.

Suppose S[j] is in the γ-th block of S.

(γ–1) L < j ≤ γL.

β= 小(縮,α+1,γ-1).

小(S, i, j) is one of

小(S, i, αL)

小(S, (γ–1)L +1, j)

小(S, (β-1)L+1, βL)

Note that each of these three is a query within a length-L block.

小(S, i, j) via 小(縮, s, t)

Algorithms in Bioinformatics, Lecture 6


Illustration

Illustration

j

i

α

γ

β

Algorithms in Bioinformatics, Lecture 6


S i j within a block

小(S, i, j) within a block

  • It remains to show how to answer 小(S, i, j) in O(1) time for any indices i and j such that (t–1)L < i≤j≤tL for some positive integer t with the help of some linear time preprocessing.

Algorithms in Bioinformatics, Lecture 6


Difference sequence

Difference sequence

  • The difference sequence 差 of S is defined as follows: 差[i] = S[i+1] – S[i].

    • Since S is a ±sequence, each 差[i] = ±1.

    • 小(S, i, j) can be determined from 差[i…j].

    • The number of distinct patterns of a length-L difference sequence is exactly 2L = |S|½.

Algorithms in Bioinformatics, Lecture 6


Preprocessing all patterns

o(|S|) time.

#row = |S|½

#col = ¼ log2 |S|

Each entry is computable in O(log |S|) time.

Answering each 小(S, i, j) takes O(1) time.

Preprocessing all patterns

Algorithms in Bioinformatics, Lecture 6


Lca lowest common ancestor

LCA: Lowest Common Ancestor

An application of RMQ for ±sequences

Algorithms in Bioinformatics, Lecture 6


Road map1

Road map

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ


Lowest common ancestor

Lowest Common Ancestor

  • T is a rooted tree.

  • 祖(x, y) is the lowest (i.e., deepest) node of T that is an ancestor of both node x and node y.

Algorithms in Bioinformatics, Lecture 6


For example

For example, …

祖(5,7)

1

祖(3,6)

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6


The challenge for x y

The challenge for 祖(x, y)

  • Input: an n-node rooted tree T.

  • Output: a data structure D for T.

  • Requirement:

    • D can be computed in O(n) time.

    • Each query 祖(x, y) for T can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6


Idea depth first traversal

1234567890123

V=1232454642171

L=1232343432121

If V[i]=x and V[j]=y,

then 祖(x, y)=V[小(L, i, j)]

Idea: depth-first traversal

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6


Idea depth first traversal1

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

祖(x, y)=V[小(L, I(x), I(y))].

O(n)-time Preprocessing

Computing V and L

Preprocessing L for queries 小(L, i, j).

Precomputing an array I such that V[I[x]] = x for each node x.

Idea: depth-first traversal

Algorithms in Bioinformatics, Lecture 6


Idea depth first traversal2

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

祖(x, y)=V[小(L, I(x), I(y))].

Query time is clearly O(1).

Idea: depth-first traversal

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6


Example

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

祖(x, y)=V[小(L, I(x), I(y))].

祖(5,7)

1

祖(3,6)

2

4

7

3

5

6

Example

Algorithms in Bioinformatics, Lecture 6


Lce longest common extension

LCE: Longest Common Extension

An application of LCA queries 祖(i, j).

Algorithms in Bioinformatics, Lecture 6


Road map2

Road map

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ


Longest common extension

Longest Common Extension

  • Suppose A and B are two strings.

  • Let 延(i, j) be the largest number d + 1 such that A[i…i+d] = B[j…j+d].

  • Example

    • A = a b a b b a

    • B = b b a a b b b

    • 延(1,1) = 0, 延(2,1) = 1,

    • 延(2,2) = 2, 延(3,4) = 3.

Algorithms in Bioinformatics, Lecture 6


The challenge for i j

The challenge for 延(i, j)

  • Input: two strings A and B.

  • Objective: output a data structure D for A and B in O(|A|+|B|) time such that each query 延(i, j) can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6


Idea suffix tree for a b

x is the i-th leaf

y is the (j+|A|+1)-st leaf.

The depth of 祖(x, y) is exactly 延(i, j).

A

#

B

$

Idea: Suffix Tree for A#B$

祖(x, y)

A-suffix

x

y

B-suffix

Algorithms in Bioinformatics, Lecture 6


Wildcard matching

Wildcard Matching

An application of longest common extension 延(i, j)

Algorithms in Bioinformatics, Lecture 6


Road map3

Road map

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ


Wildcard matching1

Wildcard Matching

  • Input: two strings P and S,

    • where P has k wildcard characters ‘?’, each could match any character of S.

  • Output: all occurrences of P in S.

Algorithms in Bioinformatics, Lecture 6


Na ve algorithm

Naïve algorithm

  • Suppose S has t distinct characters.

  • Naïve algorithm:

    Construct the suffix tree of S;

    For each of tk possibilities of P do

    Output the occurrences of P in S;

  • Time complexity = Ω(|S|+tk|P|).

Algorithms in Bioinformatics, Lecture 6


Wildcard matching via longest common extension

Suppose j1 < j2 < … < jk are the indices such that

P[j1] = P[j2] = … = P[jk] = ‘?’.

P matches S[i…i+|P|–1] if and only if

延(i, 1) ≥ j1 – 1;

延(i+ j1, j1+1) ≥ j2–j1 – 1;

延(i+ j2, j2+1) ≥ j3–j2 – 1;

延(i+ jk-1, jk-1+1) ≥ jk–jk-1 – 1; and

延(i+ jk, jk+1) ≥ |P| –jk+ 1.

Wildcard Matching via longest common extension

i

S

P

1

j1

j2

jk

|P|

Algorithms in Bioinformatics, Lecture 6


O k s time

O(k|S|) time

  • O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.

  • O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6


Fuzzy matching

Fuzzy Matching

Another application of longest common extension 延(i, j).

Algorithms in Bioinformatics, Lecture 6


Fuzzy matching1

Fuzzy Matching

  • Input: an integer k and two strings P and S

  • Output: all “fussy occurrences” of P in S, where each “fussy occurrence” allows at most k mismatched characters.

Algorithms in Bioinformatics, Lecture 6


Fuzzy occurrences

Fuzzy occurrences

  • Whether P occurs in S[i…i+|P|-1] with k or fewer errors can be determined by…

    • j= 延(i, 1); error = 0;

    • while (j < |P|)

      • If (++error > k) then return “no”;

      • j += 1 + 延(i + j + 1, j + 2);

    • return “yes”.

Algorithms in Bioinformatics, Lecture 6


O k s time1

O(k|S|) time

  • O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.

  • O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6


S magic show

小巨’s magic show

  • Amazing

Algorithms in Bioinformatics, Lecture 6


The rmq i e s i j challenge for general sequences

後翼: The RMQ (i.e., 小(S, i, j)) challengefor general sequences

Another application of lowest common ancestor

Algorithms in Bioinformatics, Lecture 6


Road map4

Road map

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ


The rmq challenge1

The RMQ challenge

  • Input: a sequence S of numbers

  • Output: a data structure D for S

  • Time complexity

    • Constant query time

      • Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

    • Linear preprocessing time

      • D can be computed in O(|S|) time.

Algorithms in Bioinformatics, Lecture 6


Idea minima tree

3

7

Idea: Minima Tree

123456789

S=432417363

  • 小(S,i,j)=祖(i,j).

5

2

4

6

9

1

8

Algorithms in Bioinformatics, Lecture 6


Homework 3

Homework 3

  • Problem 1: Grow the suffix tree for “b a a b c b b a b c”. Draw the intermediate tree with growing point and suffix links for each step of Ukkonen’s algorithm, as we did in the class. (You may turn in a ppt file with animation for this problem. )

  • Problem 2: Show how to construct a minima tree for any sequence S of numbers in O(|S|) time.

  • Due

    • 100%: 11:59pm, Nov 4, 2003

    • 50%: 1:10pm, Nov 11, 2003

Algorithms in Bioinformatics, Lecture 6


Reminder about hws

Reminder about HWs

  • Don’t turn in codes for homeworks, unless you are explicitly asked to do so.

  • As for extra-credit implementation, it has to be demo-able on WEB. So, plain codes do not count.

Algorithms in Bioinformatics, Lecture 6


Listing source strings that contains a pattern string muthukrishnan soda 02

Listing source strings that contains a pattern string [Muthukrishnan, SODA’02]

An application of RMQ for general sequences

Algorithms in Bioinformatics, Lecture 6


Road map5

Road map

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ


The problem

The problem

  • Input:

    • Strings S1, S2, …, Sm, which can be preprocessed in linear time.

    • A string P.

  • Output:

    • The index j of each Sj that contains P.

Algorithms in Bioinformatics, Lecture 6


Preliminary attempts

Preliminary attempts

  • Obtaining the suffix tree for S1#S2#…#Sm$.

    • Find all occurrences of P.

      • I.e., exact string matching for S1#S2#…#Sm$ and P.

      • Time = O(|P| + total number of occurences of P).

  • Obtaining the suffix tree for each Si.

    • Determining whether P occurs in Si.

      • I.e., substring problem for each pair Si and P.

      • Time = O(|P|m).

Algorithms in Bioinformatics, Lecture 6


The challenge

The challenge

  • Input:

    • Strings S1, S2, …, Sm, which can be preprocessed in linear time.

    • A string P.

  • Output:

    • The index j of each Sj that contains P.

  • Objective

    • O(|P| + 現(P)) time, where 現(P) is the number of output indices.

Algorithms in Bioinformatics, Lecture 6


The second attempt

Constructing the suffix tree for S1#S2#…#Sm$.

Keeping the distinct descendant leaf colors for each internal node.

Query time?

Preprocessing time?

The second attempt

Algorithms in Bioinformatics, Lecture 6


The second attempt1

Each query takes O(|P|+現(P)) time. (Why?)

The preprocessing may need Ω(m| S1#S2#…#Sm$|) time. (Why?)

Q: Any suggestions for resolving this problem?

The second attempt

Algorithms in Bioinformatics, Lecture 6


Compact representation

Keeping the list 彩 of leaf colors from left to right.

Each internal keeps the indices of leftmost and rightmost descendant leaves.

1,8

5,8

1,4

6,8

2,4

5

1

3,4

6,7

2

8

3

4

6

7

1

2

3

4

5

6

7

8

Compact Representation

Algorithms in Bioinformatics, Lecture 6


The challenge of listing distinct colors

The challenge of listing distinct colors

  • Input: a sequence 彩 of colors.

  • Output: a data structure D for 彩 such that

    • D is computable in O(|彩|) time.

    • Each 顏(i, j) = {彩(i), …, 彩(j)} query can be answered from D in O(|顏(i, j)|) time.

Algorithms in Bioinformatics, Lecture 6


An auxiliary index array

1

2

3

4

5

6

7

8

An auxiliary index array

  • Let 前[i] = 0 if 彩[j] ≠彩[i] for all j < i.

  • Let 前[i] be the largest index j with j < i such that 彩[i] = 彩[j].

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6


An observation

1

2

3

4

5

6

7

8

An observation

  • A color c is in 顏(i, j) if and only there is an index k in [i, j] such that

    • 彩[k] = c and 前[k] < i.

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6


The algorithm i j

The algorithm 解(i, j)

  • Just recursively call 破(i, j, i);

  • Subroutine 破(p, q, 左界):

    • If (p > q) then return;

    • Let k = 小(前, p, q);

    • If (k ≥左界) then return;

    • Output 彩[k];

    • Call 破(p, k – 1, 左界);

    • Call 破(k + 1, q, 左界);

Algorithms in Bioinformatics, Lecture 6


3 7 3 7 3

1

2

3

4

5

6

7

8

解(3,7) = 破(3, 7, 3)…

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6


Time o i j

Time = O(|顏(i, j)|)

  • Why?

Algorithms in Bioinformatics, Lecture 6


  • Login