cs 5263 bioinformatics
Download
Skip this Video
Download Presentation
CS 5263 Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 87

CS 5263 Bioinformatics - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

CS 5263 Bioinformatics. Lecture 3: Dynamic Programming and Sequence Alignment. Roadmap. Review of last lecture Biology Dynamic programming Sequence alignment. R. R. R. R. R. R. …. H2N. COOH. C-terminal. N-terminal. Carboxyl group. Amino group. Protein zoom-in.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 5263 Bioinformatics' - caron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs 5263 bioinformatics

CS 5263 Bioinformatics

Lecture 3: Dynamic Programming and Sequence Alignment

roadmap
Roadmap
  • Review of last lecture
    • Biology
    • Dynamic programming
  • Sequence alignment
protein zoom in

R

R

R

R

R

R

H2N

COOH

C-terminal

N-terminal

Carboxyl group

Amino group

Protein zoom-in
  • Composed of a chain of amino acids.

R

|

H2N--C--COOH

|

H

Side chain

dna replication
DNA Replication
  • The process of copying a double-stranded DNA molecule
    • Semi-conservative

5’-ACATGATAA-3’

3’-TGTACTAT-5’

5’-ACATGATAA-3’ 5’-ACATGATAA-3’

3’-TGTACTATT-5’ 3’-TGTACTATT-5’

transcription
Transcription

(where genetic information is stored)

  • DNA-RNA pair:
  • A=U, C=G
  • T=A, G=C

(for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’

Template strand: 3’-TGCATCTGCATATCTCGGATC-5’

mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’

Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

the genetic code
The Genetic Code

Third

letter

translation
Translation
  • The sequence of codons is translated to a sequence of amino acids
  • Gene: -GCT TGT TTA CGA ATT-
  • mRNA: -GCUUGUUUACGAAUU -
  • Peptide: - Alu - Cys - Leu - Arg - Ile –
  • Start codon: AUG
    • Also code Met
    • Stop codon: UGA, UAA, UAA
dynamic programming
Dynamic programming
  • What is dynamic programming?
    • Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them
elements of dynamic programming
Elements of dynamic programming
  • Optimal sub-structures
    • Optimal solutions to the original problem contains optimal solutions to sub-problems
    • Solutions to sub-problems are independent
  • Overlapping sub-problems
    • Some sub-problems appear in many solutions
    • We should not solve each sub-problem for more than once
  • Memorization and reuse
    • Carefully choose the order that sub-problems are solved
    • Tabulate the solutions
    • Bottom-up
example
Example
  • Find the shortest path in a grid

2

3

1

(0,0)

s

1

5

1

1

3

3

2

3

3

2

2

2

1

1

2

1

2

3

4

g

(3,3)

optimal substructure
Optimal substructure
  • If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal
  • Proof by contradiction
    • If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x)
    • Construct a new path P’(s,g) = P’(s,x) + P(x, g)
    • P’(s,g) < P(s,g) => P(s,g) is not the shortest
    • Contradiction
overlapping sub problems
Overlapping sub-problems
  • Some sub-problems are used by many paths

(0,0) -> (2,0) used by 3 paths

memorization and reuse
Memorization and reuse
  • Easy to tabulate and reuse
    • Number of sub-problems ~ number of nodes
    • P(s, x), for x in all nodes except s and g
  • Find an order such that no sub-problems need to be recomputed
    • First compute the smallest sub-problems
    • Use solutions of small sub-problems to solve large sub-problems
example shortest path
Example: shortest path

2

3

1

0

1

5

1

1

3

3

2

3

3

2

2

2

1

1

2

1

2

3

4

example shortest path1
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5

example shortest path2
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5

example shortest path3
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5

example shortest path4
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5

example shortest path5
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

1

1

2

1

2

3

4

5

example shortest path6
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

1

1

2

1

2

3

4

5

example shortest path7
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

example shortest path8
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

example shortest path9
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7

example shortest path10
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7

10

example shortest path11
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7

10

analysis
Analysis
  • For a nxn grid
  • Enumeration:
    • number of paths = (2n!)/(n!)^2
    • Each path has 2n steps
    • Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n))
  • Recursive call: O(2^(2n))
  • DP: O(n^2)
example fibonacci seq
Example: Fibonacci Seq
  • F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1

Function fib(n)

if (n == 0 or n == 1)

return 1;

else

return fib(n-1) + fib(n-2);

example fibonacci seq1
Example: Fibonacci Seq

function fib(n)

F[0] = 1; F[1] = 1;

For i = 2 to n

F[n] = F[n-1] + F[n-2];

End

Return F[n];

today s lecture
Today’s lecture
  • Sequence alignment
    • Global alignment
why seq alignment
Why seq alignment?
  • Similar sequences often have similar origin or function
    • Two genes are said to be homologous if they share a common evolutionary history.
    • Evolutionary history can tell us a lot about properties of a given gene
    • Homology can be inferred from similarity between the genes
  • New protein sequences are always compared to sequence databases to search for proteins with same or similar functions
  • Most widely used computational tools in biology
evolution at the dna level
Evolution at the DNA level

C

…ACGGTGCAGTCACCA…

…ACGTTGC-GTCCACCA…

Sequence edits:

Mutation, deletion, insertion

evolutionary rates
Evolutionary Rates

next generation

OK

OK

OK

X

X

Still OK?

sequence alignment
Sequence Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  • Definition
  • An alignment of two string S, T is a pair of strings S’, T’ (with spaces) s.t.
  • |S’| = |T’|, and (|S| = “length of S”)
  • removing all spaces in S’, T’ leaves S, T
what is a good alignment
What is a good alignment?

Alignment:

The “best” way to match the letters of one sequence with those of the other

How do we define “best”?

slide42

S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  • The scoreof aligning (characters or spaces) x & y is σ (x,y).
  • Scoreof an alignment:
  • An optimal alignment: one with max score
scoring function
Scoring Function
  • Sequence edits:

AGGCCTC

    • Mutations AGGACTC
    • Insertions AGGGCCTC
    • Deletions AGG-CTC

Scoring Function:

Match: +m ~~~AAC~~~

Mismatch: -s ~~~A-A~~~

Gap (indel): -d

more complex scoring function
More complex scoring function
  • Substitution matrix
    • Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor
    • It is usually defined by log likelihood ratio (Durbin book)
    • Active research area. Especially for proteins.
    • Commonly used: PAM, BLOSUM
slide46
Match = 2, mismatch = -1, gap = -1
  • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3
how to find it
How to find it?
  • A naïve algorithm:

for all subseqs A of S, B of T s.t. |A| = |B| do

align A[i] with B[i], 1 ≤i ≤|A|

align all other chars to spaces

compute its value

retain the max

end

output the retained alignment

S = abcd A = cd

T = wxyz B = xz

-abc-d a-bc-d

w--xyz -w-xyz

analysis1
Analysis
  • Assume |S| = |T| = n
  • Cost of evaluating one alignment: ≥n
  • How many alignments are there:
    • pick n chars of S,T together
    • say k of them are in S
    • match these k to the k unpicked chars of T
  • Total time:
  • E.g., for n = 20, time is > 240 >1012 operations
dynamic programming1
Dynamic Programming
  • We will now describe a dynamic programming algorithm

Suppose we wish to align

x1……xM

y1……yN

Let F(i,j) = optimal score of aligning

x1……xi

y1……yj

dynamic programming cont d
Dynamic Programming (cont’d)

Notice three possible cases:

  • xM aligns to yN

~~~~~~~ xM

~~~~~~~ yN

2. xM aligns to a gap

~~~~~~~ xM

~~~~~~~ -

  • yN aligns to a gap

~~~~~~~ -

~~~~~~~ yN

m, if xM = yN

F(M,N) = F(M-1, N-1) +

-s, if not

F(M,N) = F(M-1, N) - d

F(M,N) = F(M, N-1) - d

slide51
Therefore:

F(M-1, N-1) + (XM,YN)

  • F(M,N) = max F(M-1, N) – d

F(M, N-1) – d

(XM,YN) = m if XM = YN, and –s otherwise

  • Each sub-problem can be solved recursively
slide52
Generalize:

F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

F(i, j-1) – d

  • Be careful with the boundary conditions
slide53
Remember:
    • The recursive formula is for understanding the relationship between sub-problems
    • We cannot afford to really solve them recursively
  • Number of sub-problems:
    • Each corresponds to calculating an F(i, j)
    • O(MN) of them
    • Solve all of them
slide55
F(i-1, j-1) + (Xi,Yj)
  • F(i, j) = max F(i-1, j) – d

F(i, j-1) – d

[case 1]

[case 2]

[case 3]

F(i-1, j-1)

F(i-1, j)

1

2

F(i, j-1)

F(i, j)

3

example1
Example

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

example2
Example

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

example3
Example

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

example4
Example

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

example5
Example

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

j = 0

1

2

3

example6
Example

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

This only tells us the best score

j = 0

1

2

3

trace back
Trace-back

x = AGTA m = 1

y = ATA s = -1

d = -1

    • F(i-1, j-1) + (Xi,Yj)
  • F(i,j) = max F(i-1, j) – d
  • F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

trace back1
Trace-back

x = AGTA m = 1

y = ATA s = -1

d = -1

    • F(i-1, j-1) + (Xi,Yj)
  • F(i,j) = max F(i-1, j) – d
  • F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

trace back2
Trace-back

x = AGTA m = 1

y = ATA s = -1

d = -1

    • F(i-1, j-1) + (Xi,Yj)
  • F(i,j) = max F(i-1, j) – d
  • F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

trace back3
Trace-back

x = AGTA m = 1

y = ATA s = -1

d = -1

    • F(i-1, j-1) + (Xi,Yj)
  • F(i,j) = max F(i-1, j) – d
  • F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

trace back4
Trace-back

x = AGTA m = 1

y = ATA s = -1

d = -1

    • F(i-1, j-1) + (Xi,Yj)
  • F(i,j) = max F(i-1, j) – d
  • F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

AGTA

ATA

j = 0

1

2

3

slide68
In some cases, trace-back may be very time consuming
  • Alternative solution: remember where you come from!
    • Trade-off: more memory
using trace back pointers
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

using trace back pointers1
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

using trace back pointers2
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

using trace back pointers3
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

using trace back pointers4
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

using trace back pointers5
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

using trace back pointers6
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

using trace back pointers7
Using trace-back pointers

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

AGTA

ATA

j = 0

1

2

3

the needleman wunsch algorithm
The Needleman-Wunsch Algorithm
  • Initialization.
    • F(0, 0) = 0
    • F(0, j) = - j  d
    • F(i, 0) = - i  d
  • Main Iteration. Filling in scores
    • For each i = 1……M

For each j = 1……N

F(i-1,j) – d [case 1]

F(i, j) = max F(i, j-1) – d [case 2]

F(i-1, j-1) + σ(xi, yj) [case 3]

UP, if [case 1]

Ptr(i,j) = LEFT if [case 2]

DIAG if [case 3]

  • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment
performance
Performance
  • Time:

O(NM)

  • Space:

O(NM)

  • Later we will cover more efficient methods
a variant of the basic algorithm
A variant of the basic algorithm:
  • Maybe it is OK to have an unlimited # of gaps in the beginning and end:

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC

GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

  • Then, we don’t want to penalize gaps in the ends
the overlap detection variant
The Overlap Detection variant

Changes:

  • Initialization

For all i, j,

F(i, 0) = 0

F(0, j) = 0

  • Termination

maxi F(i, N)

FOPT = max maxj F(M, j)

x1……………………………… xM

yN……………………………… y1

a non bio variant
A non-bio variant
  • Shell command “diff” in unix
    • Given file1 and file2
    • Find the difference between file1 and file2
    • Similar to sequence alignment
    • How to score?
      • Longest common subsequence (LCS)
      • Match has score 1
      • No mismatch penalty
      • No gap penalty
slide84

$ diff file1 file2

1c1

< A

---

> G

4c4

< D

---

> -

LCS = 4

the lcs variant
The LCS variant

Changes:

  • Initialization

For all i, j, F(i, 0) = F(0, j) = 0

  • Filling in table

F(i-1,j)

F(i, j) = max F(i, j-1)

F(i-1, j-1) + σ(xi, yj)

where σ(xi, yj) = 1 if xi = yj and 0 otherwise.

  • Termination

maxi F(i, N)

FOPT = max

maxj F(M, j)

slide86
What happens if you have 1 million lines of text in each file?
  • Slow
    • What if the majority of the two files are the same? (e.g., two versions of a software)
    • Bounded DP
  • Memory inefficient
    • At least 1000 GB memory
    • Linear-space algorithm, same time complexity
ad