1 / 65

Indexed Alignment

Indexed Alignment. Tricks of the Trade Ross David Bayer 18 th October, 2005. Note: many diagrams taken from Serafim’s CS 262 class. Roadmap. Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds

ella
Download Presentation

Indexed Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexed Alignment Tricks of the Trade Ross David Bayer 18th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class

  2. Roadmap • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds

  3. Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds

  4. Motivation • We have a newly discovered gene: • Does it occur in other species? • How fast does it evolve? • We want to “find” this gene in other species • But there will be mutations

  5. Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  6. Global Alignment Needleman-Wunsch (Dynamic Programming) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Running Time: O(MN)

  7. Local Alignment Smith-Waterman M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Modifications: • Store 0 instead of –ve values • Search entire table for maximum N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Running Time: O(MN)

  8. Alignment Applications • We have our newly discovered gene: • Does it occur in other species? • How fast does it evolve?

  9. Complete genomes today About 300 complete genomes have been sequenced

  10. GenBank Growth • Exponential growth in total sequence data • Recently exceeded 100 Gbp (1011 base pairs)

  11. More DNA is coming …

  12. Alignment Applications • We have our newly discovered gene: • Does it occur in other species? • How fast does it evolve? • Assume we try Smith-Waterman: The entire genomic database 1011 Our new gene 104 1015 cells

  13. Indexed Alignment (BLAST- Basic Local Alignment Search Tool) Main idea: • Construct a dictionary of all words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) in worst case However, in practice orders of magnitude faster than Smith-Waterman query DB

  14. BLAST Step 1 (Basic): Construct dictionary of query words • Query indexed by all words of size k • Query indexed by all words of size k = 3 (in our examples) • Query indexed by all words of size k ≈ 11 AGG GGC GCT CTA TAT ATC TCA CAC GGC ACC TGA CGC GAC ACC CCT CTC TCC CCA CAG CCT GCG CTG AGG GCT CTA ATG TGC GCC CCC CCT CTA TAG AGC GCC CCG TAT ATC TCA CAC ACG CGA GAC ACC CCG GAT CGA Query: AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG… INDEX

  15. BLAST Step 1 (Advanced): Relative Generation • For each query word, generate all relatives • A relative is a word with alignment score ≥ T • All relatives are updated to point to new location Query: AGGCTATCACCTGACCTCCAGGCCG… Query word: GGC Threshold: T = 28 Relatives: GGC 30 AGC 28 GAC 28AAC 26 GGT 25 GGA 24 ... INDEX

  16. BLAST Step 2: Searching • Search through database linearly, one word at a time • Initiate alignment with all occurrences of that word in query Genomic database: AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC… AGC GCT Query: AGC GCT INDEX

  17. BLAST A C G A A G T A A G G T C C A G T Alignment Extension Example: The matching word GGT initiates an alignment Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A

  18. BLAST Algorithm Variations BLAT- BLAST-Like Alignment Tool • Builds index (dictionary) for database, scans linearly through query • Alignment extensions allow for gaps as well

  19. BLAT A C G A A G T A A G G T C C A G T Gapped Extensions Extensions with gaps in a band around anchor Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC-AG GTTAGGTCCTAG C T G A T C C T G G A T T G C G A

  20. Perfect Match Results • Perfect Match: no relatives generated

  21. Perfect Match Results

  22. Interpreting Results Word size k

  23. Interpreting Results Conservation rate Conservation rate: 81% Mutation rate: 19%

  24. Interpreting Results Sensitivity • Probability of a particular homologous area being identified • Larger k decreases probability (exact match less likely) • Straightforward mathematics Skip math

  25. Sensitivity Calculation Query • Suppose k = 7: Database (genome) Homologous area: Conservation rate: 81% Mutation rate: 19% Probability whole word is conserved: 0.817≈ 23% 7

  26. Sensitivity Calculation Query • Suppose k = 7: Database (genome) Homologous area: 23% 23% 23% 23% 23% 23% 23% 23% 23% 23% Words: 10 Probability a particular word is conserved: 23% Probability at least one word is conserved: 1 – 0.7710≈ 93%

  27. Interpreting Results Specificity • Expected number of alignments initiated by chance • Based on 500 bp query and 3 Gbp database • This is essentially an indication of SPEED

  28. Interpreting Results SPEED • Expected number of alignments initiated by chance • Based on 500 bp query and 3 Gbp database • This is essentially an indication of SPEED

  29. The Classic BLAST Tradeoff As we increase k … • Sensitivity gets worse • Speed gets better

  30. Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds

  31. Wildcards Relative Generation • Any match: 1 • Any mismatch: 0 • Threshold: T = k – 1 • Exact matches unlikely for larger values of k • Include variants with one “wildcard”placed in each position GTA *TA G*A GT*

  32. Wildcard Results Better?

  33. Wildcard Results Perfect match: For the same sensitivity, wildcard variant is about 440 times faster Wildcards:

  34. Wildcard Results Perfect match: For the same sensitivity, wildcard variant is about 40 times faster Wildcards:

  35. Wildcard Results • Better • Sensitivity/speed tradeoff consistently improved

  36. Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds

  37. Multiple Words • N perfect matches • Same separation in query and database Database: TGCTAGCTACGATCTGCAGTGCGTAATCT… Query: TCATTACATCGTGACTTGCAGTCGTCCAG… • All separations less than distance W 7 bp 12 bp TAC TGC TGC NO INITIATION INITIATE ALIGNMENT TAC TGC 12 bp Skip math

  38. Intuition Behind Multiple Words Query • If we use a single word of size k = 16: Database (genome) Homologous area: Conservation rate: 81% Mutation rate: 19% Probability whole word is conserved: 0.8116≈ 3% 16

  39. Intuition Behind Multiple Words Query • If we use a single word of size k = 16: Database (genome) Homologous area: 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% Words: 10 Probability a particular word is conserved: 3% Probability at least one word is conserved: 1 – 0.9710≈ 29%

  40. Intuition Behind Multiple Words Query • If we use a single word of size k = 16:Probability of a match = 29% • If we use N = 2 words of size k = 8: Database (genome) Homologous area: 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% Words: 20 Probability a particular word is conserved: 19% Probability at least two words are conserved: 1 – 0.8120 – 20 × 0.19 × 0.8119 ≈ 91% Probability a particular word is conserved: 0.818≈ 19%

  41. Intuition Behind Multiple Words Query • If we use a single word of size k = 16:Probability of a match = 29% • If we use N = 2 words of size k = 8:Probability of a match = 91% Database (genome) 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 3% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19% 19%

  42. Multiple Words Results

  43. Multiple Words Results Single perfect match: For the same sensitivity, multiple words variant about 1,200 times faster Multiple perfect matches:

  44. Multiple Words Results Single perfect match: For the same sensitivity, multiple words variant about 75,000 times faster Multiple perfect matches:

  45. Multiple Words Results • Much better than single matches • Bigger improvement even than wildcards

  46. Multiple Words Results • Why not combine them:Multiple Wildcard Matches?

  47. Status Check • Background Recap • Simple Tricks of the Trade • Wildcards • Multiple Words • State of the Art • Seed Patterns • Optimizing Seeds • Multiple Simultaneous Seeds

  48. Seed Patterns • Contiguous word (k = 10) GTCAGTACGTCAGTCGTGCGTCGTCTAG ×××××××××× • Seed pattern GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× GTCAGTACGT GTATTAGGCG

  49. Intuition Behind Seed Patterns Patterns increase the likelihood of at least one match within a long conserved region Consecutive Positions Non-Consecutive Positions 6 common 5 common 7 common 3 common On a 100-long 70% conserved region: ConsecutiveNon-consecutive Expected # hits: 1.07 0.97 Prob[at least one hit]: 0.30 0.47

  50. Advantage of Patterns 11 positions 11 positions 10 positions

More Related