1 / 49

“Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang

10/19. “Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang DFLW: Neda Nategh. Upcoming: 10/24: “Evolution of Multidomain Proteins” Wissam Kazan “ Human Migrations ” Anjalee Sujanani

fordon
Download Presentation

“Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 10/19 “Multiple indexes and multiple alignments”Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24: “Evolution of Multidomain Proteins” Wissam Kazan “Human Migrations” Anjalee Sujanani 10/26: “Comparison of Networks Across Species” Chuan Sheng Foo “Repetitive DNA Detection and Classification” Vijay Krishnan

  2. CS374Algorithms in Biology Searching Biological Sequence Databases Siddharth Jonathan CS374 Presentation - Searching Biological Sequence Databases

  3. Outline • Background • Problem • Typhon Overview • Typhon Components • Results CS374 Presentation - Searching Biological Sequence Databases

  4. Background • Sequence Alignment • Multiple Alignment Databases • Probabilistic Profile • Phylogenetic Tree CS374 Presentation - Searching Biological Sequence Databases

  5. Sequence Alignment • Identifying regions of similarity in the genome, proteins etc. • Types • Global • Local • Seeded • Non-seeded • Why is it important? • Comparative analysis of genomes • Producing Phylogenetic trees • Understanding newly sequenced genomes CS374 Presentation - Searching Biological Sequence Databases

  6. Seeds – A Review A seed, P = a set of ordered list of w positions i.e. P = {x1, x2, …, xw} w = weight of P = |P| s = span of P = xw – x1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6 CS374 Presentation - Searching Biological Sequence Databases

  7. Indexing in Seeded Local Alignment algorithms Gene Sequence S …G A T T A C C A G A T T A C C A G A T T A … …G A T T A C C A G A T T A C C A G A T T A … …G A T T A C C A G A T T A C C A G A T T A … GATT S,0 Seed A = {0,1,2,3} Average number of seeds indexed per position is called the Budget ATTA S,1 The same idea holds for non-contiguous seeds as well! CS374 Presentation - Searching Biological Sequence Databases

  8. Seeded Local Alignment Algorithms • BLAST • BLAT • BLASTZ • Exonerate • Usage of multiple seeds, spaced seeds • What do they have in common? • Indexing! CS374 Presentation - Searching Biological Sequence Databases

  9. Multiple alignment Species 1 Species 2 CS374 Presentation - Searching Biological Sequence Databases

  10. Phylogenetic Tree CS374 Presentation - Searching Biological Sequence Databases

  11. Probabilistic Profile Each cell corresponds to one position in the alignment… We’ll learn what information it carries very shortly! CS374 Presentation - Searching Biological Sequence Databases

  12. Regions CS374 Presentation - Searching Biological Sequence Databases

  13. The Problem Say, we have a database of multiple alignments Candidate seeds Find local alignments for the query So what’s the challenge? CS374 Presentation - Searching Biological Sequence Databases

  14. The Problem Statement Budget Can we do better? Make use of information implicit in multiple alignment for selecting which seeds to index for a given position CS374 Presentation - Searching Biological Sequence Databases

  15. The Problem Statement - Typhon Given Budget Candidate Seeds Probabilistic Profile Indexing Scheme that indexes only a subset of candidate seeds at each position CS374 Presentation - Searching Biological Sequence Databases

  16. Overall Architecture of Typhon CS374 Presentation - Searching Biological Sequence Databases

  17. Step 1: Probabilistic Profile Construction • 6 tuple for each position in the multiple alignment • Ppresent – existence probability • PA • PC • PT • PG • Pid – Probability that the corresponding query position has the consensus character Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists. Nucleotide with highest such value is called the consensus character CS374 Presentation - Searching Biological Sequence Databases

  18. Calculation of Probabilistic Profile 1 A T C Human _ A 1 C Chimp 1 A T C Rat Pig 1 C T C PPresent=100% PA=75% PC=25% PG=0% PT=0% Propagation of values up the tree to the root is a tricky problem! CS374 Presentation - Searching Biological Sequence Databases

  19. Calculating probabilistic profile • PPresent and PN calculated independently • PPresent Weighted average of children’s PPresent values. • Weights proportional to the inverse of the branch length • PN calculated through Felsentein’s algorithm with a Kimura Matrix • Pid = max(PN) (This is calculated at the root) CS374 Presentation - Searching Biological Sequence Databases

  20. Overall Architecture of Typhon CS374 Presentation - Searching Biological Sequence Databases

  21. Region Decomposition ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT 1 2 3 2 1 Each region is characterized by a PPresent and a Pid How do we come up with these regions? CS374 Presentation - Searching Biological Sequence Databases

  22. Hidden Markov Models (HMM) Given an observation sequence Predict the sequence of Hidden states CS374 Presentation - Searching Biological Sequence Databases

  23. Region Decomposition – Simple Method • Come up with a set of region classes (states) • Construct an HMM • Looking at the observation sequence, try to determine the most likely parse • Viterbi algorithm • Problem – Need to determine classes at the beginning CS374 Presentation - Searching Biological Sequence Databases

  24. Alternative • Split the Profile into 2 classes at a time • Use 2 stage HMM • Stop until bound on number of region classes is reached CS374 Presentation - Searching Biological Sequence Databases

  25. Region Decomposition with HMM CS374 Presentation - Searching Biological Sequence Databases

  26. Overall Architecture of Typhon CS374 Presentation - Searching Biological Sequence Databases

  27. Step 3: Seed Indexing What are we trying to do? 1 2 1 3 A D C B C E D A A Candidate Seeds D D B B C C C E CS374 Presentation - Searching Biological Sequence Databases

  28. The Goal • Maximize expected number of regions matched to a homologue CS374 Presentation - Searching Biological Sequence Databases

  29. Seed Assignment • 2 Approaches: • General Method • Greedy Approximation CS374 Presentation - Searching Biological Sequence Databases

  30. General Method - Terminology Size of the candidate set i Region Classes j Object[i][j] CS374 Presentation - Searching Biological Sequence Databases

  31. Calculation of number of matching regions(done for each cell in the previous table) Conditional Probability that the seeds match the region and its homologue given that it exists Probability that a region matches a homologue Number of regions X X Phit |C| ‘PPresent CS374 Presentation - Searching Biological Sequence Databases

  32. General Method - Explained Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases

  33. Some Terminology • Weight • Total Length of all regions in a region class * # of seeds indexed at each position • Sort of like the Budget for a region • Value • Expected Number of Regions matched. (previous calculation) CS374 Presentation - Searching Biological Sequence Databases

  34. Solving the Seed Assignment Problem Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases

  35. Solving the Seed Assignment Problem Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases

  36. Solving the Seed Assignment ProblemBudget =112 Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases

  37. Looks Familiar? • Closely related to the Knapsack Problem, a well studied problem in Computer Science CS374 Presentation - Searching Biological Sequence Databases

  38. Approximate Solution • Faster • Space Efficient • New Terminology : • Density of an object = Value/Weight CS374 Presentation - Searching Biological Sequence Databases

  39. Approximate Solution – General Intuition • Select objects in order of decreasing density • Disallow more than one object per row CS374 Presentation - Searching Biological Sequence Databases

  40. Approximate Method in Action Candidate Set What are the new values of Weight, Value and Density? Object[1,1] Density=V/W=3 Object[2,1] Density=V/W=2 Value = additional number of regions matched Object[3,1] Density=V/W=5 Object[3,2] Density=V/W=6 Weight = amount of budget used by this one seed. Object[4,1] Density=V/W=4 And keep track of the Budget! CS374 Presentation - Searching Biological Sequence Databases

  41. Results • Considerations • Sensitivity • Speed • Space CS374 Presentation - Searching Biological Sequence Databases

  42. Sensitivity Results • Experimental Setup • Detection of Hypothetical Homologous Alignments (HHA) • Typhon Vs Standard CS374 Presentation - Searching Biological Sequence Databases

  43. Sensitivity Comparison CS374 Presentation - Searching Biological Sequence Databases

  44. Effect of Multiple Alignment on Sensitivity CS374 Presentation - Searching Biological Sequence Databases

  45. Running time Comparison • Time spent building the index • Typhon takes longer • Time spent scanning the index • Typhon 3-4 times slower at run time which is reasonable CS374 Presentation - Searching Biological Sequence Databases

  46. Scanning time CS374 Presentation - Searching Biological Sequence Databases

  47. Conclusion • Information implicit from Multiple Alignments helps search sensitivity • Variable allocation of seeds by region classes helps (Typhon) • Space and time complexities of Typhon comparable to STANDARD • Most effective for queries far from each species in the alignment CS374 Presentation - Searching Biological Sequence Databases

  48. Questions? CS374 Presentation - Searching Biological Sequence Databases

  49. Acknowledgements • Serafim Batzoglou , George Asimenos , Jason Flannick CS374 Presentation - Searching Biological Sequence Databases

More Related