- 100 Views
- Uploaded on
- Presentation posted in: General

生物資訊相關演算法 Algorithms in Bioinformatics

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

生物資訊相關演算法Algorithms in Bioinformatics

呂學一 (中央研究院 資訊科學所)

http://www.iis.sinica.edu.tw/~hil/

Algorithms in Bioinformatics, Lecture 6

- An fundamental query that significantly strengthens suffix tree
- Range Minima Query (RMQ)
- 前翼: RMQ for ±sequences.
- 後翼: RMQ for general sequences.

- Range Minima Query (RMQ)
- Intermission – 小巨’s magic show
- “Amazing!”

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

S: a sequence of numbers.

小(S, i, j) = k if

i ≤ k ≤ j, and

S[k] = min(S[i], S[i+1], …, S[i]).

123456789

S = 340141932

小(S, 2, 6) = 3

小(S, 4, 10) = 4 (or 6).

Algorithms in Bioinformatics, Lecture 6

- Input: a sequence S of numbers
- Output: a data structure D for S
- Time complexity
- Constant query time
- Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

- Linear preprocessing time
- D can be computed in O(|S|) time.

- Constant query time

Algorithms in Bioinformatics, Lecture 6

- Storing the answer of 小(S, i, j) in a table for all index pairs i and j with 1 ≤i≤j≤ |S|.
- Query time = O(1).
- Preprocessing time = Ω(|S|2).

Algorithms in Bioinformatics, Lecture 6

- Assumption (without loss of generality)
- |S| = 2k for some positive integer k.

- Idea:
- Precomputing the values of 小(S, i, j) only for those indices i and jwith j – i + 1 = 1, 2, 4, 8, …, 2k = |S|.

- Preprocessing time
- O(|S| log |S|).

Algorithms in Bioinformatics, Lecture 6

Let k be the (unique) integer that satisfies 2k≤j – i + 1 < 2k+1.

Then, 小(S, i, j) is

x = 小(S, i, i + 2k – 1) or

y = 小(S, j – 2k + 1, j).

i

j – 2k + 1

i + 2k – 1

j

Algorithms in Bioinformatics, Lecture 6

- RMQ
- Input: O(n) numbers
- Preprocessing: O(n log n) time
- Query: O(1) time

- RMQ
- Input: O(n/log n) numbers
- Preprocessing: O(n) time
- Query: O(1) time

Algorithms in Bioinformatics, Lecture 6

前翼

The RMQ Challenge for ±sequeneces

Algorithms in Bioinformatics, Lecture 6

- S is a ±sequence if S[i] – S[i – 1] = ±1 for each index i with 2 ≤ i ≤ |S|.
- For example,
- S = 5 6 5 4 3 2 3 2 3 4 5 6 5 6 7
- + - - - - + - + + + + - + +
- S = 3 4 3 2 1 0 -1 -2 -1 0 1 2 1
- + - - - - - - + + + + -

Algorithms in Bioinformatics, Lecture 6

- Input: a ±sequenceS of numbers
- Output: a data structure D for S
- Time complexity
- Constant query time
- Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

- Linear preprocessing time
- D can be computed in O(|S|) time under the unit-cost RAM model.

- Constant query time

Algorithms in Bioinformatics, Lecture 6

- Operations such as add, minus, comparison on consecutive O(log n) bits can be performed in O(1) time.

Algorithms in Bioinformatics, Lecture 6

Any constant c < 1 is OK.

- Breaking S into blocks of length L =½ log |S|.
- There are B = 2|S|/log |S| blocks.

- Let 縮[t] be the minimum of the t-th block of S.
- 縮[t] = min {S[j] | j = (t – 1) L < j≤tL} for t = 1, 2, …, B.
- Computable in O(|S|) time.

- RMQ on 縮: 小(縮, x, y)
- O(1) query time.
- O(|S|) preprocessing time. (Why?)

Algorithms in Bioinformatics, Lecture 6

Suppose S[i] is in the α-th block of S.

(α–1) L<i≤ αL.

Suppose S[j] is in the γ-th block of S.

(γ–1) L < j ≤ γL.

β= 小(縮,α+1,γ-1).

小(S, i, j) is one of

小(S, i, αL)

小(S, (γ–1)L +1, j)

小(S, (β-1)L+1, βL)

Note that each of these three is a query within a length-L block.

Algorithms in Bioinformatics, Lecture 6

j

i

縮

α

γ

β

Algorithms in Bioinformatics, Lecture 6

- It remains to show how to answer 小(S, i, j) in O(1) time for any indices i and j such that (t–1)L < i≤j≤tL for some positive integer t with the help of some linear time preprocessing.

Algorithms in Bioinformatics, Lecture 6

- The difference sequence 差 of S is defined as follows: 差[i] = S[i+1] – S[i].
- Since S is a ±sequence, each 差[i] = ±1.
- 小(S, i, j) can be determined from 差[i…j].
- The number of distinct patterns of a length-L difference sequence is exactly 2L = |S|½.

Algorithms in Bioinformatics, Lecture 6

o(|S|) time.

#row = |S|½

#col = ¼ log2 |S|

Each entry is computable in O(log |S|) time.

Answering each 小(S, i, j) takes O(1) time.

Algorithms in Bioinformatics, Lecture 6

LCA: Lowest Common Ancestor

An application of RMQ for ±sequences

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

- T is a rooted tree.
- 祖(x, y) is the lowest (i.e., deepest) node of T that is an ancestor of both node x and node y.

Algorithms in Bioinformatics, Lecture 6

祖(5,7)

1

祖(3,6)

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

- Input: an n-node rooted tree T.
- Output: a data structure D for T.
- Requirement:
- D can be computed in O(n) time.
- Each query 祖(x, y) for T can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

If V[i]=x and V[j]=y,

then 祖(x, y)=V[小(L, i, j)]

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

祖(x, y)=V[小(L, I(x), I(y))].

O(n)-time Preprocessing

Computing V and L

Preprocessing L for queries 小(L, i, j).

Precomputing an array I such that V[I[x]] = x for each node x.

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

祖(x, y)=V[小(L, I(x), I(y))].

Query time is clearly O(1).

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

祖(x, y)=V[小(L, I(x), I(y))].

祖(5,7)

1

祖(3,6)

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

LCE: Longest Common Extension

An application of LCA queries 祖(i, j).

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

- Suppose A and B are two strings.
- Let 延(i, j) be the largest number d + 1 such that A[i…i+d] = B[j…j+d].
- Example
- A = a b a b b a
- B = b b a a b b b
- 延(1,1) = 0, 延(2,1) = 1,
- 延(2,2) = 2, 延(3,4) = 3.

Algorithms in Bioinformatics, Lecture 6

- Input: two strings A and B.
- Objective: output a data structure D for A and B in O(|A|+|B|) time such that each query 延(i, j) can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6

x is the i-th leaf

y is the (j+|A|+1)-st leaf.

The depth of 祖(x, y) is exactly 延(i, j).

A

#

B

$

祖(x, y)

A-suffix

x

y

B-suffix

Algorithms in Bioinformatics, Lecture 6

Wildcard Matching

An application of longest common extension 延(i, j)

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

- Input: two strings P and S,
- where P has k wildcard characters ‘?’, each could match any character of S.

- Output: all occurrences of P in S.

Algorithms in Bioinformatics, Lecture 6

- Suppose S has t distinct characters.
- Naïve algorithm:
Construct the suffix tree of S;

For each of tk possibilities of P do

Output the occurrences of P in S;

- Time complexity = Ω(|S|+tk|P|).

Algorithms in Bioinformatics, Lecture 6

Suppose j1 < j2 < … < jk are the indices such that

P[j1] = P[j2] = … = P[jk] = ‘?’.

P matches S[i…i+|P|–1] if and only if

延(i, 1) ≥ j1 – 1;

延(i+ j1, j1+1) ≥ j2–j1 – 1;

延(i+ j2, j2+1) ≥ j3–j2 – 1;

…

延(i+ jk-1, jk-1+1) ≥ jk–jk-1 – 1; and

延(i+ jk, jk+1) ≥ |P| –jk+ 1.

i

S

P

1

j1

j2

jk

|P|

Algorithms in Bioinformatics, Lecture 6

- O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.
- O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6

Fuzzy Matching

Another application of longest common extension 延(i, j).

Algorithms in Bioinformatics, Lecture 6

- Input: an integer k and two strings P and S
- Output: all “fussy occurrences” of P in S, where each “fussy occurrence” allows at most k mismatched characters.

Algorithms in Bioinformatics, Lecture 6

- Whether P occurs in S[i…i+|P|-1] with k or fewer errors can be determined by…
- j= 延(i, 1); error = 0;
- while (j < |P|)
- If (++error > k) then return “no”;
- j += 1 + 延(i + j + 1, j + 2);

- return “yes”.

Algorithms in Bioinformatics, Lecture 6

- O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.
- O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6

- Amazing

Algorithms in Bioinformatics, Lecture 6

後翼: The RMQ (i.e., 小(S, i, j)) challengefor general sequences

Another application of lowest common ancestor

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

- Input: a sequence S of numbers
- Output: a data structure D for S
- Time complexity
- Constant query time
- Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

- Linear preprocessing time
- D can be computed in O(|S|) time.

- Constant query time

Algorithms in Bioinformatics, Lecture 6

3

7

123456789

S=432417363

- 小(S,i,j)=祖(i,j).

5

2

4

6

9

1

8

Algorithms in Bioinformatics, Lecture 6

- Problem 1: Grow the suffix tree for “b a a b c b b a b c”. Draw the intermediate tree with growing point and suffix links for each step of Ukkonen’s algorithm, as we did in the class. (You may turn in a ppt file with animation for this problem. )
- Problem 2: Show how to construct a minima tree for any sequence S of numbers in O(|S|) time.
- Due
- 100%: 11:59pm, Nov 4, 2003
- 50%: 1:10pm, Nov 11, 2003

Algorithms in Bioinformatics, Lecture 6

- Don’t turn in codes for homeworks, unless you are explicitly asked to do so.
- As for extra-credit implementation, it has to be demo-able on WEB. So, plain codes do not count.

Algorithms in Bioinformatics, Lecture 6

Listing source strings that contains a pattern string [Muthukrishnan, SODA’02]

An application of RMQ for general sequences

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

- Input:
- Strings S1, S2, …, Sm, which can be preprocessed in linear time.
- A string P.

- Output:
- The index j of each Sj that contains P.

Algorithms in Bioinformatics, Lecture 6

- Obtaining the suffix tree for S1#S2#…#Sm$.
- Find all occurrences of P.
- I.e., exact string matching for S1#S2#…#Sm$ and P.
- Time = O(|P| + total number of occurences of P).

- Find all occurrences of P.
- Obtaining the suffix tree for each Si.
- Determining whether P occurs in Si.
- I.e., substring problem for each pair Si and P.
- Time = O(|P|m).

- Determining whether P occurs in Si.

Algorithms in Bioinformatics, Lecture 6

- Input:
- Strings S1, S2, …, Sm, which can be preprocessed in linear time.
- A string P.

- Output:
- The index j of each Sj that contains P.

- Objective
- O(|P| + 現(P)) time, where 現(P) is the number of output indices.

Algorithms in Bioinformatics, Lecture 6

Constructing the suffix tree for S1#S2#…#Sm$.

Keeping the distinct descendant leaf colors for each internal node.

Query time?

Preprocessing time?

Algorithms in Bioinformatics, Lecture 6

Each query takes O(|P|+現(P)) time. (Why?)

The preprocessing may need Ω(m| S1#S2#…#Sm$|) time. (Why?)

Q: Any suggestions for resolving this problem?

Algorithms in Bioinformatics, Lecture 6

Keeping the list 彩 of leaf colors from left to right.

Each internal keeps the indices of leftmost and rightmost descendant leaves.

1,8

5,8

1,4

6,8

2,4

5

1

3,4

6,7

2

8

3

4

6

7

1

2

3

4

5

6

7

8

彩

Algorithms in Bioinformatics, Lecture 6

- Input: a sequence 彩 of colors.
- Output: a data structure D for 彩 such that
- D is computable in O(|彩|) time.
- Each 顏(i, j) = {彩(i), …, 彩(j)} query can be answered from D in O(|顏(i, j)|) time.

Algorithms in Bioinformatics, Lecture 6

1

2

3

4

5

6

7

8

- Let 前[i] = 0 if 彩[j] ≠彩[i] for all j < i.
- Let 前[i] be the largest index j with j < i such that 彩[i] = 彩[j].

彩

前

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

1

2

3

4

5

6

7

8

- A color c is in 顏(i, j) if and only there is an index k in [i, j] such that
- 彩[k] = c and 前[k] < i.

彩

前

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

- Just recursively call 破(i, j, i);
- Subroutine 破(p, q, 左界):
- If (p > q) then return;
- Let k = 小(前, p, q);
- If (k ≥左界) then return;
- Output 彩[k];
- Call 破(p, k – 1, 左界);
- Call 破(k + 1, q, 左界);

Algorithms in Bioinformatics, Lecture 6

1

2

3

4

5

6

7

8

彩

前

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

- Why?

Algorithms in Bioinformatics, Lecture 6