1 / 64

# ????????? Algorithms in Bioinformatics - PowerPoint PPT Presentation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about '????????? Algorithms in Bioinformatics' - cuyler

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### 生物資訊相關演算法Algorithms in Bioinformatics

http://www.iis.sinica.edu.tw/~hil/

Algorithms in Bioinformatics, Lecture 6

Today – 如虎添翼

• An fundamental query that significantly strengthens suffix tree

• Range Minima Query (RMQ)

• 前翼: RMQ for ±sequences.

• 後翼: RMQ for general sequences.

• Intermission – 小巨’s magic show

• “Amazing!”

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

S: a sequence of numbers.

i ≤ k ≤ j, and

S[k] = min(S[i], S[i+1], …, S[i]).

123456789

S = 340141932

RMQ: Range Minima Query

Algorithms in Bioinformatics, Lecture 6

• Input: a sequence S of numbers

• Output: a data structure D for S

• Time complexity

• Constant query time

• Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

• Linear preprocessing time

• D can be computed in O(|S|) time.

Algorithms in Bioinformatics, Lecture 6

• Storing the answer of 小(S, i, j) in a table for all index pairs i and j with 1 ≤i≤j≤ |S|.

• Query time = O(1).

• Preprocessing time = Ω(|S|2).

Algorithms in Bioinformatics, Lecture 6

• Assumption (without loss of generality)

• |S| = 2k for some positive integer k.

• Idea:

• Precomputing the values of 小(S, i, j) only for those indices i and jwith j – i + 1 = 1, 2, 4, 8, …, 2k = |S|.

• Preprocessing time

• O(|S| log |S|).

Algorithms in Bioinformatics, Lecture 6

Let k be the (unique) integer that satisfies 2k≤j – i + 1 < 2k+1.

Then, 小(S, i, j) is

x = 小(S, i, i + 2k – 1) or

y = 小(S, j – 2k + 1, j).

i

j – 2k + 1

i + 2k – 1

j

Algorithms in Bioinformatics, Lecture 6

• RMQ

• Input: O(n) numbers

• Preprocessing: O(n log n) time

• Query: O(1) time

• RMQ

• Input: O(n/log n) numbers

• Preprocessing: O(n) time

• Query: O(1) time

Algorithms in Bioinformatics, Lecture 6

### 前翼

The RMQ Challenge for ±sequeneces

Algorithms in Bioinformatics, Lecture 6

• S is a ±sequence if S[i] – S[i – 1] = ±1 for each index i with 2 ≤ i ≤ |S|.

• For example,

• S = 5 6 5 4 3 2 3 2 3 4 5 6 5 6 7

• + - - - - + - + + + + - + +

• S = 3 4 3 2 1 0 -1 -2 -1 0 1 2 1

• + - - - - - - + + + + -

Algorithms in Bioinformatics, Lecture 6

• Input: a ±sequenceS of numbers

• Output: a data structure D for S

• Time complexity

• Constant query time

• Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

• Linear preprocessing time

• D can be computed in O(|S|) time under the unit-cost RAM model.

Algorithms in Bioinformatics, Lecture 6

• Operations such as add, minus, comparison on consecutive O(log n) bits can be performed in O(1) time.

Algorithms in Bioinformatics, Lecture 6

Any constant c < 1 is OK.

• Breaking S into blocks of length L =½ log |S|.

• There are B = 2|S|/log |S| blocks.

• Let 縮[t] be the minimum of the t-th block of S.

• 縮[t] = min {S[j] | j = (t – 1) L < j≤tL} for t = 1, 2, …, B.

• Computable in O(|S|) time.

• RMQ on 縮: 小(縮, x, y)

• O(1) query time.

• O(|S|) preprocessing time. (Why?)

Algorithms in Bioinformatics, Lecture 6

Suppose S[i] is in the α-th block of S.

(α–1) L<i≤ αL.

Suppose S[j] is in the γ-th block of S.

(γ–1) L < j ≤ γL.

β= 小(縮,α+1,γ-1).

Note that each of these three is a query within a length-L block.

Algorithms in Bioinformatics, Lecture 6

j

i

α

γ

β

Algorithms in Bioinformatics, Lecture 6

(S, i, j) within a block

• It remains to show how to answer 小(S, i, j) in O(1) time for any indices i and j such that (t–1)L < i≤j≤tL for some positive integer t with the help of some linear time preprocessing.

Algorithms in Bioinformatics, Lecture 6

• The difference sequence 差 of S is defined as follows: 差[i] = S[i+1] – S[i].

• Since S is a ±sequence, each 差[i] = ±1.

• 小(S, i, j) can be determined from 差[i…j].

• The number of distinct patterns of a length-L difference sequence is exactly 2L = |S|½.

Algorithms in Bioinformatics, Lecture 6

o(|S|) time.

#row = |S|½

#col = ¼ log2 |S|

Each entry is computable in O(log |S|) time.

Answering each 小(S, i, j) takes O(1) time.

Preprocessing all patterns

Algorithms in Bioinformatics, Lecture 6

### LCA: Lowest Common Ancestor

An application of RMQ for ±sequences

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

• T is a rooted tree.

• 祖(x, y) is the lowest (i.e., deepest) node of T that is an ancestor of both node x and node y.

Algorithms in Bioinformatics, Lecture 6

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

The challenge for 祖(x, y)

• Input: an n-node rooted tree T.

• Output: a data structure D for T.

• Requirement:

• D can be computed in O(n) time.

• Each query 祖(x, y) for T can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6

V=1232454642171

L=1232343432121

If V[i]=x and V[j]=y,

then 祖(x, y)=V[小(L, i, j)]

Idea: depth-first traversal

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

O(n)-time Preprocessing

Computing V and L

Preprocessing L for queries 小(L, i, j).

Precomputing an array I such that V[I[x]] = x for each node x.

Idea: depth-first traversal

Algorithms in Bioinformatics, Lecture 6

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

Query time is clearly O(1).

Idea: depth-first traversal

1

2

4

7

3

5

6

Algorithms in Bioinformatics, Lecture 6

V=1232454642171

L=1232343432121

1 2 3 4 5 6 7

I=1,2,3,5,6,8,12

1

2

4

7

3

5

6

Example

Algorithms in Bioinformatics, Lecture 6

### LCE: Longest Common Extension

An application of LCA queries 祖(i, j).

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

• Suppose A and B are two strings.

• Let 延(i, j) be the largest number d + 1 such that A[i…i+d] = B[j…j+d].

• Example

• A = a b a b b a

• B = b b a a b b b

• 延(1,1) = 0, 延(2,1) = 1,

• 延(2,2) = 2, 延(3,4) = 3.

Algorithms in Bioinformatics, Lecture 6

The challenge for 延(i, j)

• Input: two strings A and B.

• Objective: output a data structure D for A and B in O(|A|+|B|) time such that each query 延(i, j) can be answered from D in O(1) time.

Algorithms in Bioinformatics, Lecture 6

x is the i-th leaf

y is the (j+|A|+1)-st leaf.

The depth of 祖(x, y) is exactly 延(i, j).

A

#

B

\$

Idea: Suffix Tree for A#B\$

A-suffix

x

y

B-suffix

Algorithms in Bioinformatics, Lecture 6

### Wildcard Matching

An application of longest common extension 延(i, j)

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

• Input: two strings P and S,

• where P has k wildcard characters ‘?’, each could match any character of S.

• Output: all occurrences of P in S.

Algorithms in Bioinformatics, Lecture 6

• Suppose S has t distinct characters.

• Naïve algorithm:

Construct the suffix tree of S;

For each of tk possibilities of P do

Output the occurrences of P in S;

• Time complexity = Ω(|S|+tk|P|).

Algorithms in Bioinformatics, Lecture 6

Suppose j1 < j2 < … < jk are the indices such that

P[j1] = P[j2] = … = P[jk] = ‘?’.

P matches S[i…i+|P|–1] if and only if

Wildcard Matching via longest common extension

i

S

P

1

j1

j2

jk

|P|

Algorithms in Bioinformatics, Lecture 6

• O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.

• O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6

### Fuzzy Matching

Another application of longest common extension 延(i, j).

Algorithms in Bioinformatics, Lecture 6

• Input: an integer k and two strings P and S

• Output: all “fussy occurrences” of P in S, where each “fussy occurrence” allows at most k mismatched characters.

Algorithms in Bioinformatics, Lecture 6

• Whether P occurs in S[i…i+|P|-1] with k or fewer errors can be determined by…

• j= 延(i, 1); error = 0;

• while (j < |P|)

• If (++error > k) then return “no”;

• j += 1 + 延(i + j + 1, j + 2);

• return “yes”.

Algorithms in Bioinformatics, Lecture 6

• O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time.

• O(|S|) iterations, each takes time O(k).

Algorithms in Bioinformatics, Lecture 6

• Amazing

Algorithms in Bioinformatics, Lecture 6

### 後翼: The RMQ (i.e., 小(S, i, j)) challengefor general sequences

Another application of lowest common ancestor

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

• Input: a sequence S of numbers

• Output: a data structure D for S

• Time complexity

• Constant query time

• Each query 小(S, i, j) for S can be answered from D and S in O(1) time.

• Linear preprocessing time

• D can be computed in O(|S|) time.

Algorithms in Bioinformatics, Lecture 6

7

Idea: Minima Tree

123456789

S=432417363

• 小(S,i,j)=祖(i,j).

5

2

4

6

9

1

8

Algorithms in Bioinformatics, Lecture 6

• Problem 1: Grow the suffix tree for “b a a b c b b a b c”. Draw the intermediate tree with growing point and suffix links for each step of Ukkonen’s algorithm, as we did in the class. (You may turn in a ppt file with animation for this problem. )

• Problem 2: Show how to construct a minima tree for any sequence S of numbers in O(|S|) time.

• Due

• 100%: 11:59pm, Nov 4, 2003

• 50%: 1:10pm, Nov 11, 2003

Algorithms in Bioinformatics, Lecture 6

• Don’t turn in codes for homeworks, unless you are explicitly asked to do so.

• As for extra-credit implementation, it has to be demo-able on WEB. So, plain codes do not count.

Algorithms in Bioinformatics, Lecture 6

### Listing source strings that contains a pattern string [Muthukrishnan, SODA’02]

An application of RMQ for general sequences

Algorithms in Bioinformatics, Lecture 6

Document listing

Wildcard matching

Fuzzy matching

LCE

RMQ

LCA

+/-RMQ

The problem [Muthukrishnan,

• Input:

• Strings S1, S2, …, Sm, which can be preprocessed in linear time.

• A string P.

• Output:

• The index j of each Sj that contains P.

Algorithms in Bioinformatics, Lecture 6

Preliminary attempts [Muthukrishnan,

• Obtaining the suffix tree for S1#S2#…#Sm\$.

• Find all occurrences of P.

• I.e., exact string matching for S1#S2#…#Sm\$ and P.

• Time = O(|P| + total number of occurences of P).

• Obtaining the suffix tree for each Si.

• Determining whether P occurs in Si.

• I.e., substring problem for each pair Si and P.

• Time = O(|P|m).

Algorithms in Bioinformatics, Lecture 6

The challenge [Muthukrishnan,

• Input:

• Strings S1, S2, …, Sm, which can be preprocessed in linear time.

• A string P.

• Output:

• The index j of each Sj that contains P.

• Objective

• O(|P| + 現(P)) time, where 現(P) is the number of output indices.

Algorithms in Bioinformatics, Lecture 6

Constructing the suffix tree for [Muthukrishnan, S1#S2#…#Sm\$.

Keeping the distinct descendant leaf colors for each internal node.

Query time?

Preprocessing time?

The second attempt

Algorithms in Bioinformatics, Lecture 6

Each query takes [Muthukrishnan, O(|P|+現(P)) time. (Why?)

The preprocessing may need Ω(m| S1#S2#…#Sm\$|) time. (Why?)

Q: Any suggestions for resolving this problem?

The second attempt

Algorithms in Bioinformatics, Lecture 6

Keeping the list [Muthukrishnan, 彩 of leaf colors from left to right.

Each internal keeps the indices of leftmost and rightmost descendant leaves.

1,8

5,8

1,4

6,8

2,4

5

1

3,4

6,7

2

8

3

4

6

7

1

2

3

4

5

6

7

8

Compact Representation

Algorithms in Bioinformatics, Lecture 6

The challenge of listing distinct colors [Muthukrishnan,

• Input: a sequence 彩 of colors.

• Output: a data structure D for 彩 such that

• D is computable in O(|彩|) time.

• Each 顏(i, j) = {彩(i), …, 彩(j)} query can be answered from D in O(|顏(i, j)|) time.

Algorithms in Bioinformatics, Lecture 6

1 [Muthukrishnan,

2

3

4

5

6

7

8

An auxiliary index array

• Let 前[i] = 0 if 彩[j] ≠彩[i] for all j < i.

• Let 前[i] be the largest index j with j < i such that 彩[i] = 彩[j].

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

1 [Muthukrishnan,

2

3

4

5

6

7

8

An observation

• A color c is in 顏(i, j) if and only there is an index k in [i, j] such that

• 彩[k] = c and 前[k] < i.

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

The algorithm [Muthukrishnan, 解(i, j)

• Just recursively call 破(i, j, i);

• Subroutine 破(p, q, 左界):

• If (p > q) then return;

• Let k = 小(前, p, q);

• If (k ≥左界) then return;

• Output 彩[k];

• Call 破(p, k – 1, 左界);

• Call 破(k + 1, q, 左界);

Algorithms in Bioinformatics, Lecture 6

1 [Muthukrishnan,

2

3

4

5

6

7

8

0

0

0

2

3

1

5

6

Algorithms in Bioinformatics, Lecture 6

Time = O(| [Muthukrishnan, 顏(i, j)|)

• Why?

Algorithms in Bioinformatics, Lecture 6