1 / 44

BIC I, Week 4 lectures

BIC I, Week 4 lectures. Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu , arh@it.rit.edu. Overview of the need for Dynamic Programming. Consider Fibonacci

larue
Download Presentation

BIC I, Week 4 lectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIC I, Week 4 lectures Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu, arh@it.rit.edu

  2. Overview of the need for Dynamic Programming • Consider Fibonacci • The obvious algorithm is elegant, easily derived from the definition, and clearly correct.(define fib (lambda (n) (if (<= n 1) 1 (+ (fib (- n 2)) (fib (- n 1)))))) • But it’s hopelessly inefficient • Why? • Because it makes repeated recursive calls with the same argument

  3. The Traditional Solution • Change the order in which the computations are performed • Change the logic of the program • So that it works “bottom up” instead of “top down” • Fill an array with calculated values starting with (fib 0), then (fib 1) then (fib 2), etc. • You can do it manually, as in fib.ss • That is dynamic programming! • The main problem is that it requires thought and programming and hence may introduce error.

  4. It’s not just Fibonacci • Many programs “write themselves” from the specification of the problem. • When that happens, we are extremely pleased • Sadly, the resulting program is often inefficient • But dynamic programming is a technique to make it efficient again.

  5. Memo-izing • Redefine the function calling mechanism so that: • We first check to see if we’ve made that calculation before • If no, go ahead and compute it but store the result in a hash table • If yes, look up the previously computed value in the hash table • Do it once • Inefficient code becomes efficient automatically with no re-programming

  6. Another Example • Pascal’s triangle • Each entry is the sum of its parents • Cn,k = Cn-1,k-1 + Cn-1,k • C0,k = Cn,0 = 1 • Leading to program • Runs really slowly • Replace lambda by memolambda

  7. Review of Pattern Matching • Does CGGA appear within the sequence ATCGCGTAACGGAGATAGGCTTA ? • More generally, where does pattern p (length n) appear within text t (length m) • Boyer-Moore, or Knuth-Morris-Pratt give O(m+n) search • If p is going to change a lot and t stay the same, suffix tree can be built in O(m), each search is then O(n) • If p is stable and there are lots of different t, virtual machine can be built in O(n) and then each search is O(m)

  8. Build a Virtual Machine • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  9. First Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  10. Second Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  11. Third Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  12. Fourth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  13. Fifth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  14. Sixth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  15. Seventh Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  16. Eighth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  17. Ninth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  18. Tenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  19. Eleventh Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  20. Twelfth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  21. Thirteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  22. Fourteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  23. Fifteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  24. Sixteenth Step • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  25. 17th – 23rd Steps • CGGA ACGT C AGT C G G A C C AT AT GT ATCGCGTAACGGAGATAGGCTTA

  26. Pattern Matching – Conclusion • Exact pattern matching is easy • Often the naive algorithm is good enough • Fast algorithms are readily available • Sadly, not much use for biological tasks

  27. Why not? • What’s the difference? • Mutation • Insertion/deletion gaps • We need an inexact way to compare two (or more) biological sequences

  28. Pattern Matching vs. Sequence Alignment • In the CS world, we talk of comparing strings, or matching patterns of characters within strings • For biological applications, we talk of comparing sequences, or aligning sequences of nucleotides (or amino acids) to each other

  29. Evolutionary Relatedness • Consider ACCGT and CACGT • How likely is it that they are “related”? • Possible alignments: • ACCGT AC-CGTXX||| -|-|||CACGT -CACGT • Which is better?

  30. It Depends • ACCGT AC-CGTXX||| -|-|||CACGT -CACGT • Scoring 2 for a match, -2 for a mismatch, and –1 for a gap, 2 versus 6 • Scoring 2 for a match, 0 for a mismatch and –2 for a gap, 6 versus 4 • And we haven’t even begun to consider experimental evidence that might cause us to rank some mutations better than others!

  31. Distance measure • Score 0 for a match • 1 for a mismatch or gap • Low score best! • ACCGT AC-CGTXX||| -|-|||CACGT -CACGT • Now it’s 2 versus 2

  32. Global alignment • For two sequences • - A C C A C C-ACACC • Use the scoring scheme to fill in the table, starting with first row and first column

  33. First entries • Using the distance measure • - A C C A C C- 0 1 2 3 4 5 6A 1 C 2A 3C 4A 5 • Each nucleotide<->gap costs 1 point

  34. Extending inwards • Extending the distance measure • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 A 3 2 C 4 3 A 5 4 • Extending from North or West costs 1 point, from NW costs 0 (match) or 1 (mismatch) • Pick cheapest of the three

  35. More extension • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4 A 3 2 1 1 C 4 3 2 1 A 5 4 3 2 • mi,j = min (mi,j-1+gmi-1,j+gmi-1,j-1+cij) • where cij = 0 for a match, 1 for a mismatch

  36. Getting there... • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 A 5 4 3 2 1 • mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij) • where cij = 0 for a match, 1 for a mismatch

  37. Almost done... • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2 • mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij) • where cij = 0 for a match, 1 for a mismatch

  38. Finally, we can get a Global alignment • One of the least-cost routes • - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 01 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2 2 • Can you see how this path leads to the alignment • ACCACCAC-ACA

  39. Global alignment program • Distance measure • Runnable program • Dynamic Programming version

  40. Global vs Local Alignment • Global alignment seeks the best alignment between the complete sequence and the complete sequenceA global alignment between GATCCACCA and GTAACACA might be • G-ATCCACCA|-|X|-||-|GTAAC-AC-A • A local alignment is the best alignment between subsequences. A local alignment between GATCCACCA and GTAACACA might be • gATCCACca |X|-||gtAAC-ACa • Best local alignment depends on scoring scheme

  41. Local Alignment • For this demo, we will use a different measure • 2 for a match • -1 for a mismatch, -2 for a gap • Find best match withinG C T C T G C G A A T A GC G T T G A G A T A C T C

  42. The solution • - G C T C T G C G A A T A G - 0 0 0 0 0 0 0 0 0 0 0 0 0 0C 0 0 2 0 2 0 0 2 0 0 0 0 0 0G 0 2 0 1 0 1 2 0 4 2 0 0 0 2T 0 0 1 2 0 2 0 1 2 3 1 2 0 0T 0 0 0 3 1 2 1 0 0 1 2 3 1 0G 0 2 0 1 2 0 4 2 2 0 0 1 2 3A 0 0 1 0 0 1 2 3 1 4 2 0 3 1G 0 2 0 0 0 0 3 1 5 3 3 1 1 5A 0 0 1 0 0 0 1 2 3 75 3 3 3T 0 0 0 3 1 2 0 0 1 5 6 7 5 3A 0 0 0 1 2 0 1 0 0 3 7 5 9 7C 0 0 2 0 3 1 0 3 1 1 5 6 7 8T 0 0 0 4 2 5 3 1 2 0 3 7 5 6C 0 0 2 2 6 4 4 5 3 1 1 5 6 4 • G C T C T G C G A A T A G | | x | | X | | C G T T G A G A - T A C T C

  43. The Program • Has dynamic programming to make it fast! • This is basically Smith-Waterman • Work has been done on different scoring schemes, gap penalties, etc. • Runs in time O(mn)

  44. Exercises • that we will attempt in class: • amend global alignment program to do the “backtracking” needed for the alignment • that will be homework • amend local alignment program to do the “backtracking” needed for the alignment

More Related