algorithms in bioinformatics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
生物資訊相關演算法 Algorithms in Bioinformatics PowerPoint Presentation
Download Presentation
生物資訊相關演算法 Algorithms in Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 61

生物資訊相關演算法 Algorithms in Bioinformatics - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

生物資訊相關演算法 Algorithms in Bioinformatics. 呂學一 ( 中央研究院 資訊科學所 ) http://www.iis.sinica.edu.tw/~hil/. About your presentations. If your initial decision (including your team members and your paper) delays for k weeks,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '生物資訊相關演算法 Algorithms in Bioinformatics' - hazel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
algorithms in bioinformatics

生物資訊相關演算法Algorithms in Bioinformatics

呂學一 (中央研究院 資訊科學所)

http://www.iis.sinica.edu.tw/~hil/

about your presentations
About your presentations
  • If
    • your initial decision (including your team members and your paper) delays for k weeks,
    • If your presentation is finished m weeks ahead of the week of final exam (m could be negative),
  • then your grade for presentation will be multiplied by (1+mπ) (1 – k e).

Bioinformatics Algorithms

today
Today: 一樣不一樣 →像不像
  • Alignment
    • The standard dynamic-programming algorithm
    • Reducing the space complexity
  • Two applications
    • Longest common subsequence
    • Edit distance of two strings
  • Two variants
    • End-space free alignment
    • Local alignment

Bioinformatics Algorithms

aligning two strings
Aligning two strings
  • A = attgatcctag
  • B = acttagtccttcgc
  • A → a-ttga-tcc-tag-
  • B → actt-agtccttcgc

gap

gap

gap

gap

gap

Bioinformatics Algorithms

measuring an alignment
Measuring an alignment

Scoring matrix

Bioinformatics Algorithms

other scoring matrices
BLAST matrix

Transition/Transversion matrix

Other scoring matrices

Bioinformatics Algorithms

scoring matrix is an art
Scoring matrix is an art
  • Log odds matrix
    • score[i, j] = log (q(i, j) / p(i) p(j)).
  • PAM matrix
    • Point accepted mutations
  • BLOSOM matrix
    • Block substitution matrix
    • Steven Henikoff and Jorja G. Henikoff (1992).
  • Other specialized scoring matrices
    • Domenico Bordo and Patrick Argos (1991).
    • Jean-Michael Claverie (JCB 1993).
    • Lee F. Kowlakowski and Kenneth A. Rice (Nature 1994)

Bioinformatics Algorithms

scoring an alignment
Scoring an alignment
  • a – t t g a – t c c – t a g -
  • c c t t – a g t c c t t cg c

-2-1+2+2-1+2-1+2+2+2-1+2-2+2-1

  • score = 7

Bioinformatics Algorithms

string alignment problem
String alignment problem
  • Input:
    • two strings A and B; and
    • a scoring table 分.
  • Output:
    • an alignment of A and B that has the maximum score with respect to 分.

Bioinformatics Algorithms

q any na ve methods
Q: Any naïve methods?
  • A = attgatcctag
  • B = ccttagtccttcgc

Bioinformatics Algorithms

formulating the alignment problem as a graph problem

Formulating the alignment problem as a graph problem

Solving the graph problem using standard dynamic programming

alignment graph

c

c

t

t

a

g

t

c

a

t

t

g

a

Alignment graph

Bioinformatics Algorithms

observations
Each alignment corresponds to a maximal path on the alignment graph.

The score of the an alignment is the score of its corresponding maximal path.

c

c

t

t

a

g

t

c

a

t

t

g

a

Observations

前無古人

後無來者

c c t t - a g t c

a - t t g a - - -

Bioinformatics Algorithms

score of edges
Score of edges

B[j]

分[-, B[j]]

分[A[i], -]

分[A[i], B[j]]

A[i]

Bioinformatics Algorithms

the graph problem

The graph problem

Finding a maximal path with maximum score on the alignment graph (a directed acyclic graph)

slide16
For each i = 0, 1,…, |A| and each j = 0, 1,…, |B|, let 點[i, j] keep the maximum score of aligning A[1…i] and B[1…j]. Idea

j

0

1

|B|

B[j]

0

1

A[i]

i

|A|

Bioinformatics Algorithms

an observation
點[i, j] = the maximum of

點[i-1, j-1] + 分[A[i], B[j]]

點[i-1, j] + 分[A[i], -]

點[i, j-1] + 分[-, B[j]]

點[i-1, j-1]

點[i-1, j]

分[A[i], B[j]]

分[A[i], -]

點[i, j-1]

點[i, j]

分[-, B[j]]

An observation

Bioinformatics Algorithms

for example
For example

c

c

t

t

a

g

t

c

0

-1

-2

-3

-4

-5

-6

-7

-8

a

-1

-2

-3

-4

-5

-2

-3

-4

-5

t

-2

-3

-4

-1

-2

-3

-4

-1

-2

t

-3

-4

-5

-2

1

0

-1

-2

-3

g

-4

-5

-6

-3

0

-1

2

1

0

回顧來時徑

a

-5

-6

-7

-4

-1

2

1

0

-1

Bioinformatics Algorithms

complexity
Complexity
  • Space = O(|A|×|B|).
    • Each node keeps a score and a pointer, and thus requires only O(1) space.
  • Time = O(|A|×|B|).
    • The content of each node can be obtained from those of at most three nodes in O(1) time.

Bioinformatics Algorithms

challenge

Challenge

Reducing the space complexity

first attempt
First attempt

c

c

t

t

a

g

t

c

0

-1

-2

-3

-4

-5

-6

-7

-8

a

-1

-2

-3

-4

-5

-2

-3

-4

-5

t

-2

-3

-4

-1

-2

-3

-4

-1

-2

t

-3

-4

-5

-2

1

0

-1

-2

-3

g

What is the problem?

-4

-5

-6

-3

0

-1

2

1

0

a

-5

-6

-7

-4

-1

2

1

0

-1

Bioinformatics Algorithms

knowing the maximum score but

Knowing the maximum score, but …

Not knowing the corresponding alignment

a key observation
An optimal path passes 點[i, j] if and only if 分(A, B) is the sum of

分(A[1…i], B[1…j]) and

分(A[i+1…|A|], B[j+1…|B|]).

c

c

t

t

a

g

t

c

a

t

t

g

a

A key observation

Bioinformatics Algorithms

finding an index i
such that an optimal path passes 點[i, |B|/2] by computing all

分(A[1…i], B[1…|B|/2])

分(A[i+1…|A|], B[|B|/2+1…|B|]).

Time = O(|A||B|)?

Space = O(|A|)?

c

c

t

t

a

g

t

c

a

t

t

g

a

Finding an index i …

0

|B|/2

|B|

Bioinformatics Algorithms

trick
The following two scores are the same

分(A[i+1…|A|], B[|B|/2+1…|B|])

分(AR[1…|A| – i], BR[1…|B|/2])

0

|B|/2

|B|

c

c

t

t

a

g

t

c

a

t

t

g

a

Trick

Bioinformatics Algorithms

after locating the index i
Recursively solve two subproblems for (A1,B1) & (A2, B2), with

|A1| |B1 | + |A2| |B2 | = |A| |B| / 2.

c

c

t

t

a

g

t

c

a

t

t

g

a

After locating the index i

0

|B|/2

|B|

Bioinformatics Algorithms

overall complexity
Overall complexity
  • Time = O(|A||B|).
    • Why?
    • O(|A||B| + |A||B|/2 + |A||B|/4 + |A||B|/8 + …) = O(|A||B|).
  • Space = O(|A|).
    • Why?

Bioinformatics Algorithms

application 1

Application 1

Longest common subsequence

subsequence
Subsequence
  • For any indices 1 ≤ i1 < i2 < … <ik≤ |A|, A[i1] A[i2] A[i3]…A[ik] is a subsequence of A.
  • For example, A = 0 1 1 0 1 0 1
    • 0 1 1 1, 0 0 0, and 1 0 1 0 1 are subsequences of A.
    • 0 1 0 1 1 0 is not a subsequence of A.

Bioinformatics Algorithms

longest common subsequence
Longest Common Subsequence
  • Input: two strings A and B
  • Output: a longest string C that is a subsequence of both A and B.

Any naïve algorithm?

Bioinformatics Algorithms

it s an alignment problem
It’s an alignment problem…
  • …with respect to the following scoring matrix:

Bioinformatics Algorithms

slide33
Why?
  • Each alignment with score k corresponds to a common subsequence of length k.

0 1 1 – 1 0 - - 0 1 1 -

- 1 0 1 1 0 1 0 – 1 1 0

1 1 0 1 1

Bioinformatics Algorithms

application 2

Application 2

Edit distance between two strings

edit operations
Edit operations
  • Inserting a character at position i
  • Deleting a character at position i
  • Replacing a character at position i by a new character

Bioinformatics Algorithms

edit distance
Edit distance
  • The edit distance between two strings A and B is the minimum number of edit operations required to turn A into B.

Bioinformatics Algorithms

the edit distance problem
The edit distance problem
  • Input: two strings A and B
  • Output: the edit distance of A and B.

Any naïve algorithm?

Bioinformatics Algorithms

it s an alignment problem1
It’s an alignment problem…
  • …with respect to the following scoring matrix:

Bioinformatics Algorithms

slide39
Why?
  • Each alignment with score -k corresponds to a k edit operations that turn A into B.

0 1 1 – 1 0 - - 0 1 1 -

- 1 0 1 1 0 1 0 – 1 1 0

-1 -1-1 -1-1-1 -1

d

r

i

i

i

d

i

Bioinformatics Algorithms

a challenge

A challenge

Speeding up the edit-distance algorithm

the challenge
The challenge
  • Input: two strings A and B with |A| ≤ |B|.
  • Output: the edit distance k between A and B.
  • Objective:
    • Time: O(k|A|).
  • Note that we do not know k in advance, since otherwise it does not make any sense to solve this problem. 

Bioinformatics Algorithms

observation 1

c

c

t

t

a

g

t

c

a

t

t

g

a

Observation 1
  • Although we do not know k, we still know |B| – |A| ≤k. Why?

Bioinformatics Algorithms

observation 2
The optimal path does not pass any 點[i, j] with |i – j|>k.

Why?

c

c

t

t

a

g

t

c

a

t

t

g

a

Observation 2

Bioinformatics Algorithms

just computing i j for 2k 1 diagonals

c

c

t

t

a

g

t

c

a

t

t

g

a

Just computing 點[i,j] for 2k+1 diagonals

Bioinformatics Algorithms

some thoughts
Some thoughts
  • Idea:
    • it suffices to evaluate 點[i, j] for all indices i and j with |i – j| ≤ k.
  • But we don’t know k…
  • Modified idea:
    • It suffices to evaluate 點[i, j] for all indices i and j with |i – j| ≤ t for some number t ≥ k.
  • Q: How to find such a t?

Bioinformatics Algorithms

a key lemma
A key lemma
  • Suppose we evaluate only those 點[i, j] with |i – j| ≤ t.
    • If t ≥ k, then 點[|A|,|B|] = – k ≥ – t.
    • If t < k, then 點[|A|,|B|] ≤ – k < – t.
  • Therefore, we can determine whether t ≥ k by whether 點[|A|,|B|] ≥ – t after evaluating those 2t + 1 diagonals.

Why?

Bioinformatics Algorithms

algorithm
Algorithm
  • For s = 1, 2, 4, 8, …
    • Let t = s (|B| – |A|);
    • Evaluate those 點[i, j] with |i – j| ≤ t.
    • If 點[|A|,|B|] ≥ – t, then return – 點[|A|,|B|];

Bioinformatics Algorithms

time complexity o k a
Time complexity: O(k|A|)
  • Each iteration takes time O(t|A|).
  • The last iteration dominates the time complexity, since at the beginning of each iteration the value of t is increased by a factor of 2.
  • In the last iteration, we have t < 2k. Why?
  • The last iteration takes time O(k|A|).

Bioinformatics Algorithms

a variant

A variant

End-space-free alignment

end space free
End space free
  • - - 0 1 1 0 – 0 1 1 0
  • 1 0 0 – 1 0 0 0 1 - -
  • 2-1+2+2-1+2+2

Bioinformatics Algorithms

slide51
Same process of evaluating 點[i, j].

What’s new?

點[0, j] = 點[i, 0] = 0 for all i and j.

Output the maximum among all 點[|A|, j] and 點[i, |B|].

How?

Bioinformatics Algorithms

another approach
Another approach
  • Assign zero weights to all the edges on the boundaries, and then run the original global alignment algorithm.
  • Great idea!

Bioinformatics Algorithms

another variant

Another variant

Local alignment

local alignment
Input:

Two strings A and B

a scoring matrix 分.

Output

i1, j1, i2, and j2 with 分(A[i1, j1], B[i2, j2]) being maximized.

Local alignment

Bioinformatics Algorithms

na ve approach
Naïve approach
  • Try all possible i1, j1, i2, and j2.
  • The time complexity is terribly high.

Bioinformatics Algorithms

smarter approach
點[i, j] = the maximum of

點[i-1, j-1] + 分[A[i], B[j]]

點[i-1, j] + 分[A[i], -]

點[i, j-1] + 分[-, B[j]]

0.

i2 and j2 are the i and j with maximum 點[i, j].

Traversing backward to find i1 and j1.

點[i-1, j-1]

點[i-1, j]

分[A[i], B[j]]

分[A[i], -]

點[i, j-1]

點[i, j]

分[-, B[j]]

Smarter approach

Bioinformatics Algorithms

for example1
For example

c

c

t

t

a

g

t

c

0

0

0

0

0

0

0

0

0

a

0

0

0

0

0

2

1

0

0

t

0

0

0

2

2

1

0

3

2

t

0

0

0

2

4

3

2

2

1

g

0

0

0

1

3

2

5

4

3

a

0

0

0

0

2

5

4

3

2

Bioinformatics Algorithms

complexity1
Complexity
  • Time and space = O(|A| |B|).
  • Q: Can we reduce the required space complexity to O(|A|)?

Bioinformatics Algorithms

some thoughts1
Some thoughts
  • The difficulty lies in that we cannot afford the space for keeping those “backward pointer” to do 回顧來時徑.
  • Do we really need them?

Bioinformatics Algorithms

time o a b space o a
Time: O(|A||B|)Space: O(|A|)
  • We first find j1 and j2 that maximizes 點[j1, j2].
  • In order to find i1 and i2, we can
    • Let A’ = the reverse string of A[1…j1].
    • Let B’ = the reverse string of B[1…j2].
    • Solve the local alignment problem for A’ & B’ and obtain k1 and k2.
    • i1 = j1 – k1 + 1.
    • i2 = j2 – k2 + 1.

Bioinformatics Algorithms

another approach1
Another approach
  • Each 點[j1, j2] also keeps the index pair i1 and i2 such that 分(A[i1, j1], B[i2, j2]) is equal to the score kept in 點[j1, j2].

Bioinformatics Algorithms