Cs 5263 bioinformatics
Download
1 / 87

CS 5263 Bioinformatics - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

CS 5263 Bioinformatics. Lecture 3: Dynamic Programming and Sequence Alignment. Roadmap. Review of last lecture Biology Dynamic programming Sequence alignment. R. R. R. R. R. R. …. H2N. COOH. C-terminal. N-terminal. Carboxyl group. Amino group. Protein zoom-in.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 5263 Bioinformatics' - caron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs 5263 bioinformatics

CS 5263 Bioinformatics

Lecture 3: Dynamic Programming and Sequence Alignment


Roadmap
Roadmap

  • Review of last lecture

    • Biology

    • Dynamic programming

  • Sequence alignment


Protein zoom in

R

R

R

R

R

R

H2N

COOH

C-terminal

N-terminal

Carboxyl group

Amino group

Protein zoom-in

  • Composed of a chain of amino acids.

    R

    |

    H2N--C--COOH

    |

    H

Side chain



Dna replication
DNA Replication

  • The process of copying a double-stranded DNA molecule

    • Semi-conservative

      5’-ACATGATAA-3’

      3’-TGTACTAT-5’

      5’-ACATGATAA-3’ 5’-ACATGATAA-3’

      3’-TGTACTATT-5’ 3’-TGTACTATT-5’


Transcription
Transcription

(where genetic information is stored)

  • DNA-RNA pair:

  • A=U, C=G

  • T=A, G=C

(for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’

Template strand: 3’-TGCATCTGCATATCTCGGATC-5’

mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’

Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.


The genetic code
The Genetic Code

Third

letter


Translation
Translation

  • The sequence of codons is translated to a sequence of amino acids

  • Gene: -GCT TGT TTA CGA ATT-

  • mRNA: -GCUUGUUUACGAAUU -

  • Peptide: - Alu - Cys - Leu - Arg - Ile –

  • Start codon: AUG

    • Also code Met

    • Stop codon: UGA, UAA, UAA


Dynamic programming
Dynamic programming

  • What is dynamic programming?

    • Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them


Elements of dynamic programming
Elements of dynamic programming

  • Optimal sub-structures

    • Optimal solutions to the original problem contains optimal solutions to sub-problems

    • Solutions to sub-problems are independent

  • Overlapping sub-problems

    • Some sub-problems appear in many solutions

    • We should not solve each sub-problem for more than once

  • Memorization and reuse

    • Carefully choose the order that sub-problems are solved

    • Tabulate the solutions

    • Bottom-up


Example
Example

  • Find the shortest path in a grid

2

3

1

(0,0)

s

1

5

1

1

3

3

2

3

3

2

2

2

1

1

2

1

2

3

4

g

(3,3)


Optimal substructure
Optimal substructure

  • If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal

  • Proof by contradiction

    • If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x)

    • Construct a new path P’(s,g) = P’(s,x) + P(x, g)

    • P’(s,g) < P(s,g) => P(s,g) is not the shortest

    • Contradiction


Overlapping sub problems
Overlapping sub-problems

  • Some sub-problems are used by many paths

(0,0) -> (2,0) used by 3 paths


Memorization and reuse
Memorization and reuse

  • Easy to tabulate and reuse

    • Number of sub-problems ~ number of nodes

    • P(s, x), for x in all nodes except s and g

  • Find an order such that no sub-problems need to be recomputed

    • First compute the smallest sub-problems

    • Use solutions of small sub-problems to solve large sub-problems


Example shortest path
Example: shortest path

2

3

1

0

1

5

1

1

3

3

2

3

3

2

2

2

1

1

2

1

2

3

4


Example shortest path1
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path2
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path3
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path4
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

1

1

2

1

2

3

4

5


Example shortest path5
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

1

1

2

1

2

3

4

5


Example shortest path6
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

1

1

2

1

2

3

4

5


Example shortest path7
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5


Example shortest path8
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5


Example shortest path9
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7


Example shortest path10
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7

10


Example shortest path11
Example: shortest path

2

3

1

0

2

5

6

1

5

1

1

3

1

2

3

6

3

2

3

3

2

2

2

4

4

6

8

1

1

2

1

2

3

4

5

5

7

10


Analysis
Analysis

  • For a nxn grid

  • Enumeration:

    • number of paths = (2n!)/(n!)^2

    • Each path has 2n steps

    • Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n))

  • Recursive call: O(2^(2n))

  • DP: O(n^2)


Example fibonacci seq
Example: Fibonacci Seq

  • F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1

    Function fib(n)

    if (n == 0 or n == 1)

    return 1;

    else

    return fib(n-1) + fib(n-2);



Example fibonacci seq1
Example: Fibonacci Seq

function fib(n)

F[0] = 1; F[1] = 1;

For i = 2 to n

F[n] = F[n-1] + F[n-2];

End

Return F[n];




Today s lecture
Today’s lecture the table?

  • Sequence alignment

    • Global alignment


Why seq alignment
Why seq alignment? the table?

  • Similar sequences often have similar origin or function

    • Two genes are said to be homologous if they share a common evolutionary history.

    • Evolutionary history can tell us a lot about properties of a given gene

    • Homology can be inferred from similarity between the genes

  • New protein sequences are always compared to sequence databases to search for proteins with same or similar functions

  • Most widely used computational tools in biology


Evolution at the dna level
Evolution at the DNA level the table?

C

…ACGGTGCAGTCACCA…

…ACGTTGC-GTCCACCA…

Sequence edits:

Mutation, deletion, insertion


Evolutionary rates
Evolutionary Rates the table?

next generation

OK

OK

OK

X

X

Still OK?



Sequence alignment
Sequence Alignment the table?

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  • Definition

  • An alignment of two string S, T is a pair of strings S’, T’ (with spaces) s.t.

  • |S’| = |T’|, and (|S| = “length of S”)

  • removing all spaces in S’, T’ leaves S, T


What is a good alignment
What is a good alignment? the table?

Alignment:

The “best” way to match the letters of one sequence with those of the other

How do we define “best”?


S’: - the table?AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  • The scoreof aligning (characters or spaces) x & y is σ (x,y).

  • Scoreof an alignment:

  • An optimal alignment: one with max score


Scoring function
Scoring Function the table?

  • Sequence edits:

    AGGCCTC

    • Mutations AGGACTC

    • Insertions AGGGCCTC

    • Deletions AGG-CTC

      Scoring Function:

      Match: +m ~~~AAC~~~

      Mismatch: -s ~~~A-A~~~

      Gap (indel): -d


More complex scoring function
More complex scoring function the table?

  • Substitution matrix

    • Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor

    • It is usually defined by log likelihood ratio (Durbin book)

    • Active research area. Especially for proteins.

    • Commonly used: PAM, BLOSUM




How to find it
How to find it? the table?

  • A naïve algorithm:

    for all subseqs A of S, B of T s.t. |A| = |B| do

    align A[i] with B[i], 1 ≤i ≤|A|

    align all other chars to spaces

    compute its value

    retain the max

    end

    output the retained alignment

S = abcd A = cd

T = wxyz B = xz

-abc-d a-bc-d

w--xyz -w-xyz


Analysis1
Analysis the table?

  • Assume |S| = |T| = n

  • Cost of evaluating one alignment: ≥n

  • How many alignments are there:

    • pick n chars of S,T together

    • say k of them are in S

    • match these k to the k unpicked chars of T

  • Total time:

  • E.g., for n = 20, time is > 240 >1012 operations


Dynamic programming1
Dynamic Programming the table?

  • We will now describe a dynamic programming algorithm

    Suppose we wish to align

    x1……xM

    y1……yN

    Let F(i,j) = optimal score of aligning

    x1……xi

    y1……yj


Dynamic programming cont d
Dynamic Programming (cont the table?’d)

Notice three possible cases:

  • xM aligns to yN

    ~~~~~~~ xM

    ~~~~~~~ yN

    2. xM aligns to a gap

    ~~~~~~~ xM

    ~~~~~~~ -

  • yN aligns to a gap

    ~~~~~~~ -

    ~~~~~~~ yN

m, if xM = yN

F(M,N) = F(M-1, N-1) +

-s, if not

F(M,N) = F(M-1, N) - d

F(M,N) = F(M, N-1) - d


  • Therefore: the table?

    F(M-1, N-1) + (XM,YN)

  • F(M,N) = max F(M-1, N) – d

    F(M, N-1) – d

    (XM,YN) = m if XM = YN, and –s otherwise

  • Each sub-problem can be solved recursively


  • Generalize: the table?

    F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

    F(i, j-1) – d

  • Be careful with the boundary conditions


  • Remember: the table?

    • The recursive formula is for understanding the relationship between sub-problems

    • We cannot afford to really solve them recursively

  • Number of sub-problems:

    • Each corresponds to calculating an F(i, j)

    • O(MN) of them

    • Solve all of them



F(i-1, j-1) + the table?(Xi,Yj)

  • F(i, j) = max F(i-1, j) – d

    F(i, j-1) – d

[case 1]

[case 2]

[case 3]

F(i-1, j-1)

F(i-1, j)

1

2

F(i, j-1)

F(i, j)

3



Example1
Example the table?

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example2
Example the table?

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example3
Example the table?

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example4
Example the table?

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3


Example5
Example the table?

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

j = 0

1

2

3


Example6
Example the table?

x = AGTA m = 1

y = ATA s = -1

d = -1

F(i,j) i = 0 1 2 3 4

Optimal Alignment:

F(4,3) = 2

This only tells us the best score

j = 0

1

2

3


Trace back
Trace-back the table?

x = AGTA m = 1

y = ATA s = -1

d = -1

  • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back1
    Trace-back the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back2
    Trace-back the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back3
    Trace-back the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Trace back4
    Trace-back the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    • F(i-1, j-1) + (Xi,Yj)

  • F(i,j) = max F(i-1, j) – d

  • F(i, j-1) – d

  • F(i,j) i = 0 1 2 3 4

    Optimal Alignment:

    F(4,3) = 2

    AGTA

    ATA

    j = 0

    1

    2

    3



    Using trace back pointers
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers1
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers2
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers3
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers4
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers5
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers6
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    j = 0

    1

    2

    3


    Using trace back pointers7
    Using trace-back pointers the table?

    x = AGTA m = 1

    y = ATA s = -1

    d = -1

    F(i,j) i = 0 1 2 3 4

    Optimal Alignment:

    F(4,3) = 2

    AGTA

    ATA

    j = 0

    1

    2

    3


    The needleman wunsch algorithm
    The Needleman-Wunsch Algorithm the table?

    • Initialization.

      • F(0, 0) = 0

      • F(0, j) = - j  d

      • F(i, 0) = - i  d

    • Main Iteration. Filling in scores

      • For each i = 1……M

        For each j = 1……N

        F(i-1,j) – d [case 1]

        F(i, j) = max F(i, j-1) – d [case 2]

        F(i-1, j-1) + σ(xi, yj) [case 3]

        UP, if [case 1]

        Ptr(i,j) = LEFT if [case 2]

        DIAG if [case 3]

    • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment


    Performance
    Performance the table?

    • Time:

      O(NM)

    • Space:

      O(NM)

    • Later we will cover more efficient methods


    A variant of the basic algorithm
    A variant of the basic algorithm: the table?

    • Maybe it is OK to have an unlimited # of gaps in the beginning and end:

    ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC

    GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

    • Then, we don’t want to penalize gaps in the ends


    The overlap detection variant
    The the table?Overlap Detection variant

    Changes:

    • Initialization

      For all i, j,

      F(i, 0) = 0

      F(0, j) = 0

    • Termination

      maxi F(i, N)

      FOPT = max maxj F(M, j)

    x1……………………………… xM

    yN……………………………… y1


    Different types of overlaps
    Different types of overlaps the table?

    x

    x

    y

    y


    A non bio variant
    A non-bio variant the table?

    • Shell command “diff” in unix

      • Given file1 and file2

      • Find the difference between file1 and file2

      • Similar to sequence alignment

      • How to score?

        • Longest common subsequence (LCS)

        • Match has score 1

        • No mismatch penalty

        • No gap penalty


    $ diff file1 file2 the table?

    1c1

    < A

    ---

    > G

    4c4

    < D

    ---

    > -

    LCS = 4


    The lcs variant
    The the table?LCS variant

    Changes:

    • Initialization

      For all i, j, F(i, 0) = F(0, j) = 0

    • Filling in table

      F(i-1,j)

      F(i, j) = max F(i, j-1)

      F(i-1, j-1) + σ(xi, yj)

      where σ(xi, yj) = 1 if xi = yj and 0 otherwise.

    • Termination

      maxi F(i, N)

      FOPT = max

      maxj F(M, j)




    ad