CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment

Roadmap • Review of last lecture • Biology • Dynamic programming • Sequence alignment

R R R R R R … H2N COOH C-terminal N-terminal Carboxyl group Amino group Protein zoom-in • Composed of a chain of amino acids. R | H2N--C--COOH | H Side chain

Genome, Chromosome, Gene

DNA Replication • The process of copying a double-stranded DNA molecule • Semi-conservative 5’-ACATGATAA-3’ 3’-TGTACTAT-5’  5’-ACATGATAA-3’ 5’-ACATGATAA-3’ 3’-TGTACTATT-5’ 3’-TGTACTATT-5’

Transcription (where genetic information is stored) • DNA-RNA pair: • A=U, C=G • T=A, G=C (for making mRNA) Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

The Genetic Code Third letter

Translation • The sequence of codons is translated to a sequence of amino acids • Gene: -GCT TGT TTA CGA ATT- • mRNA: -GCUUGUUUACGAAUU - • Peptide: - Alu - Cys - Leu - Arg - Ile – • Start codon: AUG • Also code Met • Stop codon: UGA, UAA, UAA

Dynamic programming • What is dynamic programming? • Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them

Elements of dynamic programming • Optimal sub-structures • Optimal solutions to the original problem contains optimal solutions to sub-problems • Solutions to sub-problems are independent • Overlapping sub-problems • Some sub-problems appear in many solutions • We should not solve each sub-problem for more than once • Memorization and reuse • Carefully choose the order that sub-problems are solved • Tabulate the solutions • Bottom-up

Example • Find the shortest path in a grid 2 3 1 (0,0) s 1 5 1 1 3 3 2 3 3 2 2 2 1 1 2 1 2 3 4 g (3,3)

Optimal substructure • If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal • Proof by contradiction • If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x) • Construct a new path P’(s,g) = P’(s,x) + P(x, g) • P’(s,g) < P(s,g) => P(s,g) is not the shortest • Contradiction

Overlapping sub-problems • Some sub-problems are used by many paths (0,0) -> (2,0) used by 3 paths

Memorization and reuse • Easy to tabulate and reuse • Number of sub-problems ~ number of nodes • P(s, x), for x in all nodes except s and g • Find an order such that no sub-problems need to be recomputed • First compute the smallest sub-problems • Use solutions of small sub-problems to solve large sub-problems

Example: shortest path 2 3 1 0 1 5 1 1 3 3 2 3 3 2 2 2 1 1 2 1 2 3 4

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 1 1 2 1 2 3 4 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 1 1 2 1 2 3 4 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5 5

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5 5 7

Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5 5 7 10

Analysis • For a nxn grid • Enumeration: • number of paths = (2n!)/(n!)^2 • Each path has 2n steps • Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n)) • Recursive call: O(2^(2n)) • DP: O(n^2)

Example: Fibonacci Seq • F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1 Function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2);

Time complexity: O(1.62^n)

Example: Fibonacci Seq function fib(n) F[0] = 1; F[1] = 1; For i = 2 to n F[n] = F[n-1] + F[n-2]; End Return F[n];

Time: O(n), space: O(n)

What if it is not so easy to figure out an order to fill in the table? • Exercise

Today’s lecture • Sequence alignment • Global alignment

Why seq alignment? • Similar sequences often have similar origin or function • Two genes are said to be homologous if they share a common evolutionary history. • Evolutionary history can tell us a lot about properties of a given gene • Homology can be inferred from similarity between the genes • New protein sequences are always compared to sequence databases to search for proteins with same or similar functions • Most widely used computational tools in biology

Evolution at the DNA level C …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… Sequence edits: Mutation, deletion, insertion

Evolutionary Rates next generation OK OK OK X X Still OK?

Sequence conservation implies function

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Definition • An alignment of two string S, T is a pair of strings S’, T’ (with spaces) s.t. • |S’| = |T’|, and (|S| = “length of S”) • removing all spaces in S’, T’ leaves S, T

What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”?

S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The scoreof aligning (characters or spaces) x & y is σ (x,y). • Scoreof an alignment: • An optimal alignment: one with max score

Scoring Function • Sequence edits: AGGCCTC • Mutations AGGACTC • Insertions AGGGCCTC • Deletions AGG-CTC Scoring Function: Match: +m ~~~AAC~~~ Mismatch: -s ~~~A-A~~~ Gap (indel): -d

More complex scoring function • Substitution matrix • Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor • It is usually defined by log likelihood ratio (Durbin book) • Active research area. Especially for proteins. • Commonly used: PAM, BLOSUM

An example substitution matrix

Match = 2, mismatch = -1, gap = -1 • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

How to find it? • A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

Analysis • Assume |S| = |T| = n • Cost of evaluating one alignment: ≥n • How many alignments are there: • pick n chars of S,T together • say k of them are in S • match these k to the k unpicked chars of T • Total time: • E.g., for n = 20, time is > 240 >1012 operations

Dynamic Programming • We will now describe a dynamic programming algorithm Suppose we wish to align x1……xM y1……yN Let F(i,j) = optimal score of aligning x1……xi y1……yj

Dynamic Programming (cont’d) Notice three possible cases: • xM aligns to yN ~~~~~~~ xM ~~~~~~~ yN 2. xM aligns to a gap ~~~~~~~ xM ~~~~~~~ - • yN aligns to a gap ~~~~~~~ - ~~~~~~~ yN m, if xM = yN F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

CS 5263 Bioinformatics