1 / 61

# 生物資訊相關演算法 Algorithms in Bioinformatics - PowerPoint PPT Presentation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about '生物資訊相關演算法 Algorithms in Bioinformatics' - hazel

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### 生物資訊相關演算法Algorithms in Bioinformatics

http://www.iis.sinica.edu.tw/~hil/

• If
• If your presentation is finished m weeks ahead of the week of final exam (m could be negative),
• then your grade for presentation will be multiplied by (1+mπ) (1 – k e).

Bioinformatics Algorithms

Today: 一樣不一樣 →像不像
• Alignment
• The standard dynamic-programming algorithm
• Reducing the space complexity
• Two applications
• Longest common subsequence
• Edit distance of two strings
• Two variants
• End-space free alignment
• Local alignment

Bioinformatics Algorithms

Aligning two strings
• A = attgatcctag
• B = acttagtccttcgc
• A → a-ttga-tcc-tag-
• B → actt-agtccttcgc

gap

gap

gap

gap

gap

Bioinformatics Algorithms

Measuring an alignment

Scoring matrix

Bioinformatics Algorithms

BLAST matrix

Transition/Transversion matrix

Other scoring matrices

Bioinformatics Algorithms

Scoring matrix is an art
• Log odds matrix
• score[i, j] = log (q(i, j) / p(i) p(j)).
• PAM matrix
• Point accepted mutations
• BLOSOM matrix
• Block substitution matrix
• Steven Henikoff and Jorja G. Henikoff (1992).
• Other specialized scoring matrices
• Domenico Bordo and Patrick Argos (1991).
• Jean-Michael Claverie (JCB 1993).
• Lee F. Kowlakowski and Kenneth A. Rice (Nature 1994)

Bioinformatics Algorithms

Scoring an alignment
• a – t t g a – t c c – t a g -
• c c t t – a g t c c t t cg c

-2-1+2+2-1+2-1+2+2+2-1+2-2+2-1

• score = 7

Bioinformatics Algorithms

String alignment problem
• Input:
• two strings A and B; and
• a scoring table 分.
• Output:
• an alignment of A and B that has the maximum score with respect to 分.

Bioinformatics Algorithms

Q: Any naïve methods?
• A = attgatcctag
• B = ccttagtccttcgc

Bioinformatics Algorithms

### Formulating the alignment problem as a graph problem

Solving the graph problem using standard dynamic programming

c

c

t

t

a

g

t

c

a

t

t

g

a

Alignment graph

Bioinformatics Algorithms

Each alignment corresponds to a maximal path on the alignment graph.

The score of the an alignment is the score of its corresponding maximal path.

c

c

t

t

a

g

t

c

a

t

t

g

a

Observations

c c t t - a g t c

a - t t g a - - -

Bioinformatics Algorithms

Score of edges

B[j]

A[i]

Bioinformatics Algorithms

### The graph problem

Finding a maximal path with maximum score on the alignment graph (a directed acyclic graph)

For each i = 0, 1,…, |A| and each j = 0, 1,…, |B|, let 點[i, j] keep the maximum score of aligning A[1…i] and B[1…j]. Idea

j

0

1

|B|

B[j]

0

1

A[i]

i

|A|

Bioinformatics Algorithms

An observation

Bioinformatics Algorithms

For example

c

c

t

t

a

g

t

c

0

-1

-2

-3

-4

-5

-6

-7

-8

a

-1

-2

-3

-4

-5

-2

-3

-4

-5

t

-2

-3

-4

-1

-2

-3

-4

-1

-2

t

-3

-4

-5

-2

1

0

-1

-2

-3

g

-4

-5

-6

-3

0

-1

2

1

0

a

-5

-6

-7

-4

-1

2

1

0

-1

Bioinformatics Algorithms

Complexity
• Space = O(|A|×|B|).
• Each node keeps a score and a pointer, and thus requires only O(1) space.
• Time = O(|A|×|B|).
• The content of each node can be obtained from those of at most three nodes in O(1) time.

Bioinformatics Algorithms

### Challenge

Reducing the space complexity

First attempt

c

c

t

t

a

g

t

c

0

-1

-2

-3

-4

-5

-6

-7

-8

a

-1

-2

-3

-4

-5

-2

-3

-4

-5

t

-2

-3

-4

-1

-2

-3

-4

-1

-2

t

-3

-4

-5

-2

1

0

-1

-2

-3

g

What is the problem?

-4

-5

-6

-3

0

-1

2

1

0

a

-5

-6

-7

-4

-1

2

1

0

-1

Bioinformatics Algorithms

### Knowing the maximum score, but …

Not knowing the corresponding alignment

### Q: Can we deduce an optimal alignment from the optimal score?

c

c

t

t

a

g

t

c

a

t

t

g

a

A key observation

Bioinformatics Algorithms

Time = O(|A||B|)?

Space = O(|A|)?

c

c

t

t

a

g

t

c

a

t

t

g

a

Finding an index i …

0

|B|/2

|B|

Bioinformatics Algorithms

The following two scores are the same

0

|B|/2

|B|

c

c

t

t

a

g

t

c

a

t

t

g

a

Trick

Bioinformatics Algorithms

|A1| |B1 | + |A2| |B2 | = |A| |B| / 2.

c

c

t

t

a

g

t

c

a

t

t

g

a

After locating the index i

0

|B|/2

|B|

Bioinformatics Algorithms

Overall complexity
• Time = O(|A||B|).
• Why?
• O(|A||B| + |A||B|/2 + |A||B|/4 + |A||B|/8 + …) = O(|A||B|).
• Space = O(|A|).
• Why?

Bioinformatics Algorithms

### Application 1

Longest common subsequence

Subsequence
• For any indices 1 ≤ i1 < i2 < … <ik≤ |A|, A[i1] A[i2] A[i3]…A[ik] is a subsequence of A.
• For example, A = 0 1 1 0 1 0 1
• 0 1 1 1, 0 0 0, and 1 0 1 0 1 are subsequences of A.
• 0 1 0 1 1 0 is not a subsequence of A.

Bioinformatics Algorithms

Longest Common Subsequence
• Input: two strings A and B
• Output: a longest string C that is a subsequence of both A and B.

Any naïve algorithm?

Bioinformatics Algorithms

It’s an alignment problem…
• …with respect to the following scoring matrix:

Bioinformatics Algorithms

Why?
• Each alignment with score k corresponds to a common subsequence of length k.

0 1 1 – 1 0 - - 0 1 1 -

- 1 0 1 1 0 1 0 – 1 1 0

1 1 0 1 1

Bioinformatics Algorithms

### Application 2

Edit distance between two strings

Edit operations
• Inserting a character at position i
• Deleting a character at position i
• Replacing a character at position i by a new character

Bioinformatics Algorithms

Edit distance
• The edit distance between two strings A and B is the minimum number of edit operations required to turn A into B.

Bioinformatics Algorithms

The edit distance problem
• Input: two strings A and B
• Output: the edit distance of A and B.

Any naïve algorithm?

Bioinformatics Algorithms

It’s an alignment problem…
• …with respect to the following scoring matrix:

Bioinformatics Algorithms

Why?
• Each alignment with score -k corresponds to a k edit operations that turn A into B.

0 1 1 – 1 0 - - 0 1 1 -

- 1 0 1 1 0 1 0 – 1 1 0

-1 -1-1 -1-1-1 -1

d

r

i

i

i

d

i

Bioinformatics Algorithms

### A challenge

Speeding up the edit-distance algorithm

The challenge
• Input: two strings A and B with |A| ≤ |B|.
• Output: the edit distance k between A and B.
• Objective:
• Time: O(k|A|).
• Note that we do not know k in advance, since otherwise it does not make any sense to solve this problem. 

Bioinformatics Algorithms

c

c

t

t

a

g

t

c

a

t

t

g

a

Observation 1
• Although we do not know k, we still know |B| – |A| ≤k. Why?

Bioinformatics Algorithms

Why?

c

c

t

t

a

g

t

c

a

t

t

g

a

Observation 2

Bioinformatics Algorithms

c

c

t

t

a

g

t

c

a

t

t

g

a

Just computing 點[i,j] for 2k+1 diagonals

Bioinformatics Algorithms

Some thoughts
• Idea:
• it suffices to evaluate 點[i, j] for all indices i and j with |i – j| ≤ k.
• But we don’t know k…
• Modified idea:
• It suffices to evaluate 點[i, j] for all indices i and j with |i – j| ≤ t for some number t ≥ k.
• Q: How to find such a t?

Bioinformatics Algorithms

A key lemma
• Suppose we evaluate only those 點[i, j] with |i – j| ≤ t.
• If t ≥ k, then 點[|A|,|B|] = – k ≥ – t.
• If t < k, then 點[|A|,|B|] ≤ – k < – t.
• Therefore, we can determine whether t ≥ k by whether 點[|A|,|B|] ≥ – t after evaluating those 2t + 1 diagonals.

Why?

Bioinformatics Algorithms

Algorithm
• For s = 1, 2, 4, 8, …
• Let t = s (|B| – |A|);
• Evaluate those 點[i, j] with |i – j| ≤ t.
• If 點[|A|,|B|] ≥ – t, then return – 點[|A|,|B|];

Bioinformatics Algorithms

Time complexity: O(k|A|)
• Each iteration takes time O(t|A|).
• The last iteration dominates the time complexity, since at the beginning of each iteration the value of t is increased by a factor of 2.
• In the last iteration, we have t < 2k. Why?
• The last iteration takes time O(k|A|).

Bioinformatics Algorithms

### A variant

End-space-free alignment

End space free
• - - 0 1 1 0 – 0 1 1 0
• 1 0 0 – 1 0 0 0 1 - -
• 2-1+2+2-1+2+2

Bioinformatics Algorithms

Same process of evaluating 點[i, j].

What’s new?

Output the maximum among all 點[|A|, j] and 點[i, |B|].

How?

Bioinformatics Algorithms

Another approach
• Assign zero weights to all the edges on the boundaries, and then run the original global alignment algorithm.
• Great idea!

Bioinformatics Algorithms

### Another variant

Local alignment

Input:

Two strings A and B

a scoring matrix 分.

Output

i1, j1, i2, and j2 with 分(A[i1, j1], B[i2, j2]) being maximized.

Local alignment

Bioinformatics Algorithms

Naïve approach
• Try all possible i1, j1, i2, and j2.
• The time complexity is terribly high.

Bioinformatics Algorithms

0.

i2 and j2 are the i and j with maximum 點[i, j].

Traversing backward to find i1 and j1.

Smarter approach

Bioinformatics Algorithms

For example

c

c

t

t

a

g

t

c

0

0

0

0

0

0

0

0

0

a

0

0

0

0

0

2

1

0

0

t

0

0

0

2

2

1

0

3

2

t

0

0

0

2

4

3

2

2

1

g

0

0

0

1

3

2

5

4

3

a

0

0

0

0

2

5

4

3

2

Bioinformatics Algorithms

Complexity
• Time and space = O(|A| |B|).
• Q: Can we reduce the required space complexity to O(|A|)?

Bioinformatics Algorithms

Some thoughts
• The difficulty lies in that we cannot afford the space for keeping those “backward pointer” to do 回顧來時徑.
• Do we really need them?

Bioinformatics Algorithms

Time: O(|A||B|)Space: O(|A|)
• We first find j1 and j2 that maximizes 點[j1, j2].
• In order to find i1 and i2, we can
• Let A’ = the reverse string of A[1…j1].
• Let B’ = the reverse string of B[1…j2].
• Solve the local alignment problem for A’ & B’ and obtain k1 and k2.
• i1 = j1 – k1 + 1.
• i2 = j2 – k2 + 1.

Bioinformatics Algorithms

Another approach
• Each 點[j1, j2] also keeps the index pair i1 and i2 such that 分(A[i1, j1], B[i2, j2]) is equal to the score kept in 點[j1, j2].

Bioinformatics Algorithms