1 / 64

# 生物資訊相關演算法 Algorithms in Bioinformatics - PowerPoint PPT Presentation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' 生物資訊相關演算法 Algorithms in Bioinformatics' - cuyler

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### 生物資訊相關演算法Algorithms in Bioinformatics

http://www.iis.sinica.edu.tw/~hil/

Algorithms in Bioinformatics, Lecture 6

Today – 如虎添翼
• An fundamental query that significantly strengthens suffix tree
• Range Minima Query (RMQ)
• 前翼: RMQ for ±sequences.
• 後翼: RMQ for general sequences.
• Intermission – 小巨’s magic show
• “Amazing!”

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

S: a sequence of numbers.

i ≤ k ≤ j, and

S[k] = min(S[i], S[i+1], …, S[i]).

123456789

S = 340141932

RMQ: Range Minima Query

Algorithms in Bioinformatics, Lecture 6

The RMQ challenge
• Input: a sequence S of numbers
• Output: a data structure D for S
• Time complexity
• Constant query time
• Each query 小(S, i, j) for S can be answered from D and S in O(1) time.
• Linear preprocessing time
• D can be computed in O(|S|) time.

Algorithms in Bioinformatics, Lecture 6

Naïve approach
• Storing the answer of 小(S, i, j) in a table for all index pairs i and j with 1 ≤i≤j≤ |S|.
• Query time = O(1).
• Preprocessing time = Ω(|S|2).

Algorithms in Bioinformatics, Lecture 6

Faster Preprocessing
• Assumption (without loss of generality)
• |S| = 2k for some positive integer k.
• Idea:
• Precomputing the values of 小(S, i, j) only for those indices i and jwith j – i + 1 = 1, 2, 4, 8, …, 2k = |S|.
• Preprocessing time
• O(|S| log |S|).

Algorithms in Bioinformatics, Lecture 6

Then, 小(S, i, j) is

x = 小(S, i, i + 2k – 1) or

y = 小(S, j – 2k + 1, j).

i

j – 2k + 1

i + 2k – 1

j

Algorithms in Bioinformatics, Lecture 6

As a result
• RMQ
• Input: O(n) numbers
• Preprocessing: O(n log n) time
• Query: O(1) time
• RMQ
• Input: O(n/log n) numbers
• Preprocessing: O(n) time
• Query: O(1) time

Algorithms in Bioinformatics, Lecture 6

### 前翼

The RMQ Challenge for ±sequeneces

Algorithms in Bioinformatics, Lecture 6

±sequeneces
• S is a ±sequence if S[i] – S[i – 1] = ±1 for each index i with 2 ≤ i ≤ |S|.
• For example,
• S = 5 6 5 4 3 2 3 2 3 4 5 6 5 6 7
• + - - - - + - + + + + - + +
• S = 3 4 3 2 1 0 -1 -2 -1 0 1 2 1
• + - - - - - - + + + + -

Algorithms in Bioinformatics, Lecture 6

• Input: a ±sequenceS of numbers
• Output: a data structure D for S
• Time complexity
• Constant query time
• Each query 小(S, i, j) for S can be answered from D and S in O(1) time.
• Linear preprocessing time
• D can be computed in O(|S|) time under the unit-cost RAM model.

Algorithms in Bioinformatics, Lecture 6

Unit-Cost RAM model
• Operations such as add, minus, comparison on consecutive O(log n) bits can be performed in O(1) time.

Algorithms in Bioinformatics, Lecture 6

Idea: compression

Any constant c < 1 is OK.

• Breaking S into blocks of length L =½ log |S|.
• There are B = 2|S|/log |S| blocks.
• Let 縮[t] be the minimum of the t-th block of S.
• 縮[t] = min {S[j] | j = (t – 1) L < j≤tL} for t = 1, 2, …, B.
• Computable in O(|S|) time.
• RMQ on 縮: 小(縮, x, y)
• O(1) query time.
• O(|S|) preprocessing time. (Why?)

Algorithms in Bioinformatics, Lecture 6

Suppose S[i] is in the α-th block of S.

(α–1) L<i≤ αL.

Suppose S[j] is in the γ-th block of S.

(γ–1) L < j ≤ γL.

β= 小(縮,α+1,γ-1).

Note that each of these three is a query within a length-L block.

Algorithms in Bioinformatics, Lecture 6

Illustration

j

i

α

γ

β

Algorithms in Bioinformatics, Lecture 6

• It remains to show how to answer 小(S, i, j) in O(1) time for any indices i and j such that (t–1)L < i≤j≤tL for some positive integer t with the help of some linear time preprocessing.

Algorithms in Bioinformatics, Lecture 6

Difference sequence
• The difference sequence 差 of S is defined as follows: 差[i] = S[i+1] – S[i].
• Since S is a ±sequence, each 差[i] = ±1.
• 小(S, i, j) can be determined from 差[i…j].
• The number of distinct patterns of a length-L difference sequence is exactly 2L = |S|½.

Algorithms in Bioinformatics, Lecture 6

o(|S|) time.

#row = |S|½

#col = ¼ log2 |S|

Each entry is computable in O(log |S|) time.

Answering each 小(S, i, j) takes O(1) time.

Preprocessing all patterns

Algorithms in Bioinformatics, Lecture 6

### LCA: Lowest Common Ancestor

An application of RMQ for ±sequences

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

Lowest Common Ancestor
• T is a rooted tree.
• 祖(x, y) is the lowest (i.e., deepest) node of T that is an ancestor of both node x and node y.

Algorithms in Bioinformatics, Lecture 6

For example, …

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

The challenge for 祖(x, y)
• Input: an n-node rooted tree T.
• Output: a data structure D for T.
• Requirement:
• D can be computed in O(n) time.
• Each query 祖(x, y) for T can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

If V[i]=x and V[j]=y,

then 祖(x, y)=V[小(L, i, j)]

Idea: depth-first traversal

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

O(n)-time Preprocessing

Computing V and L

Preprocessing L for queries 小(L, i, j).

Precomputing an array I such that V[I[x]] = x for each node x.

Idea: depth-first traversal

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

Query time is clearly O(1).

Idea: depth-first traversal

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

1234567890123

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

1

2

4

7

3

5

6

Example

Algorithms in Bioinformatics, Lecture 6

### LCE: Longest Common Extension

An application of LCA queries 祖(i, j).

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

Longest Common Extension
• Suppose A and B are two strings.
• Let 延(i, j) be the largest number d + 1 such that A[i…i+d] = B[j…j+d].
• Example
• A = a b a b b a
• B = b b a a b b b
• 延(1,1) = 0, 延(2,1) = 1,
• 延(2,2) = 2, 延(3,4) = 3.

Algorithms in Bioinformatics, Lecture 6

The challenge for 延(i, j)
• Input: two strings A and B.
• Objective: output a data structure D for A and B in O(|A|+|B|) time such that each query 延(i, j) can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6

x is the i-th leaf

y is the (j+|A|+1)-st leaf.

The depth of 祖(x, y) is exactly 延(i, j).

A

#

B

\$

Idea: Suffix Tree for A#B\$

A-suffix

x

y

B-suffix

Algorithms in Bioinformatics, Lecture 6

### Wildcard Matching

An application of longest common extension 延(i, j)

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

Wildcard Matching
• Input: two strings P and S,
• where P has k wildcard characters ‘?’, each could match any character of S.
• Output: all occurrences of P in S.

Algorithms in Bioinformatics, Lecture 6

Naïve algorithm
• Suppose S has t distinct characters.
• Naïve algorithm:

Construct the suffix tree of S;

For each of tk possibilities of P do

Output the occurrences of P in S;

• Time complexity = Ω(|S|+tk|P|).

Algorithms in Bioinformatics, Lecture 6

Suppose j1 < j2 < … < jk are the indices such that

P[j1] = P[j2] = … = P[jk] = ‘?’.

P matches S[i…i+|P|–1] if and only if

Wildcard Matching via longest common extension

i

S

P

1

j1

j2

jk

|P|

Algorithms in Bioinformatics, Lecture 6

O(k|S|) time
• O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.
• O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6

### Fuzzy Matching

Another application of longest common extension 延(i, j).

Algorithms in Bioinformatics, Lecture 6

Fuzzy Matching
• Input: an integer k and two strings P and S
• Output: all “fussy occurrences” of P in S, where each “fussy occurrence” allows at most k mismatched characters.

Algorithms in Bioinformatics, Lecture 6

Fuzzy occurrences
• Whether P occurs in S[i…i+|P|-1] with k or fewer errors can be determined by…
• j= 延(i, 1); error = 0;
• while (j < |P|)
• If (++error > k) then return “no”;
• j += 1 + 延(i + j + 1, j + 2);
• return “yes”.

Algorithms in Bioinformatics, Lecture 6

O(k|S|) time
• O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.
• O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6

• Amazing

Algorithms in Bioinformatics, Lecture 6

### 後翼: The RMQ (i.e., 小(S, i, j)) challengefor general sequences

Another application of lowest common ancestor

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

The RMQ challenge
• Input: a sequence S of numbers
• Output: a data structure D for S
• Time complexity
• Constant query time
• Each query 小(S, i, j) for S can be answered from D and S in O(1) time.
• Linear preprocessing time
• D can be computed in O(|S|) time.

Algorithms in Bioinformatics, Lecture 6

3

7

Idea: Minima Tree

123456789

S=432417363

• 小(S,i,j)=祖(i,j).

5

2

4

6

9

1

8

Algorithms in Bioinformatics, Lecture 6

Homework 3
• Problem 1: Grow the suffix tree for “b a a b c b b a b c”. Draw the intermediate tree with growing point and suffix links for each step of Ukkonen’s algorithm, as we did in the class. (You may turn in a ppt file with animation for this problem. )
• Problem 2: Show how to construct a minima tree for any sequence S of numbers in O(|S|) time.
• Due
• 100%: 11:59pm, Nov 4, 2003
• 50%: 1:10pm, Nov 11, 2003

Algorithms in Bioinformatics, Lecture 6

• Don’t turn in codes for homeworks, unless you are explicitly asked to do so.
• As for extra-credit implementation, it has to be demo-able on WEB. So, plain codes do not count.

Algorithms in Bioinformatics, Lecture 6

### Listing source strings that contains a pattern string [Muthukrishnan, SODA’02]

An application of RMQ for general sequences

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

The problem
• Input:
• Strings S1, S2, …, Sm, which can be preprocessed in linear time.
• A string P.
• Output:
• The index j of each Sj that contains P.

Algorithms in Bioinformatics, Lecture 6

Preliminary attempts
• Obtaining the suffix tree for S1#S2#…#Sm\$.
• Find all occurrences of P.
• I.e., exact string matching for S1#S2#…#Sm\$ and P.
• Time = O(|P| + total number of occurences of P).
• Obtaining the suffix tree for each Si.
• Determining whether P occurs in Si.
• I.e., substring problem for each pair Si and P.
• Time = O(|P|m).

Algorithms in Bioinformatics, Lecture 6

The challenge
• Input:
• Strings S1, S2, …, Sm, which can be preprocessed in linear time.
• A string P.
• Output:
• The index j of each Sj that contains P.
• Objective
• O(|P| + 現(P)) time, where 現(P) is the number of output indices.

Algorithms in Bioinformatics, Lecture 6

Constructing the suffix tree for S1#S2#…#Sm\$.

Keeping the distinct descendant leaf colors for each internal node.

Query time?

Preprocessing time?

The second attempt

Algorithms in Bioinformatics, Lecture 6

Each query takes O(|P|+現(P)) time. (Why?)

The preprocessing may need Ω(m| S1#S2#…#Sm\$|) time. (Why?)

Q: Any suggestions for resolving this problem?

The second attempt

Algorithms in Bioinformatics, Lecture 6

Keeping the list 彩 of leaf colors from left to right.

Each internal keeps the indices of leftmost and rightmost descendant leaves.

1,8

5,8

1,4

6,8

2,4

5

1

3,4

6,7

2

8

3

4

6

7

1

2

3

4

5

6

7

8

Compact Representation

Algorithms in Bioinformatics, Lecture 6

The challenge of listing distinct colors
• Input: a sequence 彩 of colors.
• Output: a data structure D for 彩 such that
• D is computable in O(|彩|) time.
• Each 顏(i, j) = {彩(i), …, 彩(j)} query can be answered from D in O(|顏(i, j)|) time.

Algorithms in Bioinformatics, Lecture 6

1

2

3

4

5

6

7

8

An auxiliary index array
• Let 前[i] = 0 if 彩[j] ≠彩[i] for all j < i.
• Let 前[i] be the largest index j with j < i such that 彩[i] = 彩[j].

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

1

2

3

4

5

6

7

8

An observation
• A color c is in 顏(i, j) if and only there is an index k in [i, j] such that
• 彩[k] = c and 前[k] < i.

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

The algorithm 解(i, j)
• Just recursively call 破(i, j, i);
• Subroutine 破(p, q, 左界):
• If (p > q) then return;
• Let k = 小(前, p, q);
• If (k ≥左界) then return;
• Output 彩[k];
• Call 破(p, k – 1, 左界);
• Call 破(k + 1, q, 左界);

Algorithms in Bioinformatics, Lecture 6

1

2

3

4

5

6

7

8

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

Time = O(|顏(i, j)|)
• Why?

Algorithms in Bioinformatics, Lecture 6