Cs 5263 bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 87

CS 5263 Bioinformatics PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on
  • Presentation posted in: General

CS 5263 Bioinformatics. Lecture 3: Dynamic Programming and Sequence Alignment. Roadmap. Review of last lecture Biology Dynamic programming Sequence alignment. R. R. R. R. R. R. …. H2N. COOH. C-terminal. N-terminal. Carboxyl group. Amino group. Protein zoom-in.

Download Presentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs 5263 bioinformatics

CS 5263 Bioinformatics

Lecture 3: Dynamic Programming and Sequence Alignment


Roadmap

Roadmap

  • Review of last lecture

    • Biology

    • Dynamic programming

  • Sequence alignment


Protein zoom in

R

R

R

R

R

R

H2N

COOH

C-terminal

N-terminal

Carboxyl group

Amino group

Protein zoom-in

  • Composed of a chain of amino acids.

    R

    |

    H2N--C--COOH

    |

    H

Side chain


Genome chromosome gene

Genome, Chromosome, Gene


Dna replication

DNA Replication

  • The process of copying a double-stranded DNA molecule

    • Semi-conservative

      5’-ACATGATAA-3’

      3’-TGTACTAT-5’

      5’-ACATGATAA-3’ 5’-ACATGATAA-3’

      3’-TGTACTATT-5’ 3’-TGTACTATT-5’


Transcription

Transcription

(where genetic information is stored)

  • DNA-RNA pair:

  • A=U, C=G

  • T=A, G=C

(for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’

Template strand: 3’-TGCATCTGCATATCTCGGATC-5’

mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’

Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.


The genetic code

The Genetic Code

Third

letter


Translation

Translation

  • The sequence of codons is translated to a sequence of amino acids

  • Gene: -GCT TGT TTA CGA ATT-

  • mRNA: -GCUUGUUUACGAAUU -

  • Peptide: - Alu - Cys - Leu - Arg - Ile –

  • Start codon: AUG

    • Also code Met

    • Stop codon: UGA, UAA, UAA


Dynamic programming

Dynamic programming

  • What is dynamic programming?

    • Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them


Elements of dynamic programming

Elements of dynamic programming

  • Optimal sub-structures

    • Optimal solutions to the original problem contains optimal solutions to sub-problems

    • Solutions to sub-problems are independent

  • Overlapping sub-problems

    • Some sub-problems appear in many solutions

    • We should not solve each sub-problem for more than once

  • Memorization and reuse

    • Carefully choose the order that sub-problems are solved

    • Tabulate the solutions

    • Bottom-up


Example

Example

  • Find the shortest path in a grid

2

3

1

(0,0)

s

1

5

1

1

3

3

2

3

3

2

2

2

1

1

2

1

2

3

4

g

(3,3)


Optimal substructure

Optimal substructure

  • If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal

  • Proof by contradiction

    • If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x)

    • Construct a new path P’(s,g) = P’(s,x) + P(x, g)

    • P’(s,g) < P(s,g) => P(s,g) is not the shortest

    • Contradiction


Overlapping sub problems

Overlapping sub-problems

  • Some sub-problems are used by many paths

(0,0) -> (2,0) used by 3 paths


Memorization and reuse

Memorization and reuse

  • Easy to tabulate and reuse

    • Number of sub-problems ~ number of nodes

    • P(s, x), for x in all nodes except s and g

  • Find an order such that no sub-problems need to be recomputed

    • First compute the smallest sub-problems

    • Use solutions of small sub-problems to solve large sub-problems


Example shortest path

Example: shortest path

2

3

1

0

1

5

1

1

3

3

2

3

3

2

2

2

1

1

2

1

2

3

4


Example shortest path1

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path2

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path3

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path4

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path5

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

1

1

2

1

2

3

4

5


Example shortest path6

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

1

1

2

1

2

3

4

5


Example shortest path7

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5


Example shortest path8

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5


Example shortest path9

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7


Example shortest path10

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7

10


Example shortest path11

Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7

10


Analysis

Analysis

  • For a nxn grid

  • Enumeration:

    • number of paths = (2n!)/(n!)^2

    • Each path has 2n steps

    • Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n))

  • Recursive call: O(2^(2n))

  • DP: O(n^2)


Example fibonacci seq

Example: Fibonacci Seq

  • F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1

    Function fib(n)

    if (n == 0 or n == 1)

    return 1;

    else

    return fib(n-1) + fib(n-2);


Cs 5263 bioinformatics

Time complexity: O(1.62^n)


Example fibonacci seq1

Example: Fibonacci Seq

function fib(n)

F[0] = 1;F[1] = 1;

For i = 2 to n

F[n] = F[n-1] + F[n-2];

End

Return F[n];


Cs 5263 bioinformatics

Time: O(n), space: O(n)


Cs 5263 bioinformatics

  • What if it is not so easy to figure out an order to fill in the table?

  • Exercise


Today s lecture

Today’s lecture

  • Sequence alignment

    • Global alignment


Why seq alignment

Why seq alignment?

  • Similar sequences often have similar origin or function

    • Two genes are said to be homologous if they share a common evolutionary history.

    • Evolutionary history can tell us a lot about properties of a given gene

    • Homology can be inferred from similarity between the genes

  • New protein sequences are always compared to sequence databases to search for proteins with same or similar functions

  • Most widely used computational tools in biology


Evolution at the dna level

Evolution at the DNA level

C

…ACGGTGCAGTCACCA…

…ACGTTGC-GTCCACCA…

Sequence edits:

Mutation, deletion, insertion


Evolutionary rates

Evolutionary Rates

next generation

OK

OK

OK

X

X

Still OK?


Sequence conservation implies function

Sequence conservation implies function


Sequence alignment

Sequence Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  • Definition

  • An alignment of two string S, T is a pair of strings S’, T’ (with spaces) s.t.

  • |S’| = |T’|, and (|S| = “length of S”)

  • removing all spaces in S’, T’ leaves S, T


What is a good alignment

What is a good alignment?

Alignment:

The “best” way to match the letters of one sequence with those of the other

How do we define “best”?


Cs 5263 bioinformatics

S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  • The scoreof aligning (characters or spaces) x & y is σ (x,y).

  • Scoreof an alignment:

  • An optimal alignment: one with max score


Scoring function

Scoring Function

  • Sequence edits:

    AGGCCTC

    • Mutations AGGACTC

    • InsertionsAGGGCCTC

    • DeletionsAGG-CTC

      Scoring Function:

      Match: +m~~~AAC~~~

      Mismatch: -s~~~A-A~~~

      Gap (indel):-d


More complex scoring function

More complex scoring function

  • Substitution matrix

    • Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor

    • It is usually defined by log likelihood ratio (Durbin book)

    • Active research area. Especially for proteins.

    • Commonly used: PAM, BLOSUM


An example substitution matrix

An example substitution matrix


Cs 5263 bioinformatics

  • Match = 2, mismatch = -1, gap = -1

  • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3


How to find it

How to find it?

  • A naïve algorithm:

    for all subseqs A of S, B of T s.t. |A| = |B| do

    align A[i] with B[i], 1 ≤i ≤|A|

    align all other chars to spaces

    compute its value

    retain the max

    end

    output the retained alignment

S = abcd A = cd

T = wxyz B = xz

-abc-d a-bc-d

w--xyz -w-xyz


Analysis1

Analysis

  • Assume |S| = |T| = n

  • Cost of evaluating one alignment: ≥n

  • How many alignments are there:

    • pick n chars of S,T together

    • say k of them are in S

    • match these k to the k unpicked chars of T

  • Total time:

  • E.g., for n = 20, time is > 240 >1012 operations


Dynamic programming1

Dynamic Programming

  • We will now describe a dynamic programming algorithm

    Suppose we wish to align

    x1……xM

    y1……yN

    Let F(i,j) = optimal score of aligning

    x1……xi

    y1……yj


Dynamic programming cont d

Dynamic Programming (cont’d)

Notice three possible cases:

  • xM aligns to yN

    ~~~~~~~ xM

    ~~~~~~~ yN

    2.xM aligns to a gap

    ~~~~~~~ xM

    ~~~~~~~ -

  • yN aligns to a gap

    ~~~~~~~ -

    ~~~~~~~ yN

m, if xM = yN

F(M,N) = F(M-1, N-1) +

-s, if not

F(M,N) = F(M-1, N) - d

F(M,N) = F(M, N-1) - d


Cs 5263 bioinformatics

  • Therefore:

    F(M-1, N-1) + (XM,YN)

  • F(M,N) = max F(M-1, N) – d

    F(M, N-1) – d

    (XM,YN) = m if XM = YN, and –s otherwise

  • Each sub-problem can be solved recursively


Cs 5263 bioinformatics

  • Generalize:

    F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

    F(i, j-1) – d

  • Be careful with the boundary conditions


Cs 5263 bioinformatics

  • Remember:

    • The recursive formula is for understanding the relationship between sub-problems

    • We cannot afford to really solve them recursively

  • Number of sub-problems:

    • Each corresponds to calculating an F(i, j)

    • O(MN) of them

    • Solve all of them


What order to fill

What order to fill?


Cs 5263 bioinformatics

F(i-1, j-1) + (Xi,Yj)

  • F(i, j) = max F(i-1, j) – d

    F(i, j-1) – d

[case 1]

[case 2]

[case 3]

F(i-1, j-1)

F(i-1, j)

1

2

F(i, j-1)

F(i, j)

3


What order to fill1

What order to fill?


Example1

Example

x = AGTAm = 1

y = ATAs = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example2

Example

x = AGTAm = 1

y = ATAs = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example3

Example

x = AGTAm = 1

y = ATAs = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example4

Example

x = AGTAm = 1

y = ATAs = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example5

Example

x = AGTAm = 1

y = ATAs = -1

d = -1

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

j = 0

1

2

3


Example6

Example

x = AGTAm = 1

y = ATAs = -1

d = -1

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

This only tells us the best score

j = 0

1

2

3


Trace back

Trace-back

x = AGTAm = 1

y = ATAs = -1

d = -1

  • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back1

    Trace-back

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back2

    Trace-back

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back3

    Trace-back

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back4

    Trace-back

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    Optimal Alignment:

    F(4,3) = 2

    AGTA

    ATA

    j = 0

    1

    2

    3


    Cs 5263 bioinformatics

    • In some cases, trace-back may be very time consuming

    • Alternative solution: remember where you come from!

      • Trade-off: more memory


    Using trace back pointers

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers1

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers2

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers3

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers4

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers5

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers6

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers7

    Using trace-back pointers

    x = AGTAm = 1

    y = ATAs = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    Optimal Alignment:

    F(4,3) = 2

    AGTA

    ATA

    j = 0

    1

    2

    3


    The needleman wunsch algorithm

    The Needleman-Wunsch Algorithm

    • Initialization.

      • F(0, 0) = 0

      • F(0, j) = - j  d

      • F(i, 0)= - i  d

    • Main Iteration. Filling in scores

      • For each i = 1……M

        For each j = 1……N

        F(i-1,j) – d [case 1]

        F(i, j) = max F(i, j-1) – d [case 2]

        F(i-1, j-1) + σ(xi, yj) [case 3]

        UP, if [case 1]

        Ptr(i,j)= LEFTif [case 2]

        DIAGif [case 3]

    • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment


    Performance

    Performance

    • Time:

      O(NM)

    • Space:

      O(NM)

    • Later we will cover more efficient methods


    A variant of the basic algorithm

    A variant of the basic algorithm:

    • Maybe it is OK to have an unlimited # of gaps in the beginning and end:

    ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC

    GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

    • Then, we don’t want to penalize gaps in the ends


    The overlap detection variant

    The Overlap Detection variant

    Changes:

    • Initialization

      For all i, j,

      F(i, 0) = 0

      F(0, j) = 0

    • Termination

      maxi F(i, N)

      FOPT = max maxj F(M, j)

    x1……………………………… xM

    yN……………………………… y1


    Different types of overlaps

    Different types of overlaps

    x

    x

    y

    y


    A non bio variant

    A non-bio variant

    • Shell command “diff” in unix

      • Given file1 and file2

      • Find the difference between file1 and file2

      • Similar to sequence alignment

      • How to score?

        • Longest common subsequence (LCS)

        • Match has score 1

        • No mismatch penalty

        • No gap penalty


    Cs 5263 bioinformatics

    $ diff file1 file2

    1c1

    < A

    ---

    > G

    4c4

    < D

    ---

    > -

    LCS = 4


    The lcs variant

    The LCS variant

    Changes:

    • Initialization

      For all i, j, F(i, 0) = F(0, j) = 0

    • Filling in table

      F(i-1,j)

      F(i, j) = max F(i, j-1)

      F(i-1, j-1) + σ(xi, yj)

      where σ(xi, yj) = 1 if xi = yj and 0 otherwise.

    • Termination

      maxi F(i, N)

      FOPT = max

      maxj F(M, j)


    Cs 5263 bioinformatics

    • What happens if you have 1 million lines of text in each file?

    • Slow

      • What if the majority of the two files are the same? (e.g., two versions of a software)

      • Bounded DP

    • Memory inefficient

      • At least 1000 GB memory

      • Linear-space algorithm, same time complexity


    Cs 5263 bioinformatics

    See you next week


  • Login