1 / 50

Tools For HTS Analysis

Tools For HTS Analysis. Michael Brudno and Marc Fiume Department of Computer Science University of Toronto. Outline. Lab focus Our tools SHRiMP : read mapper VARiD : SNP and indel finder Savant : genome browser Discussion. Our Tools. READ MAPPING ( SHRiMP ). ASSEMBLY (UNNAMED).

candra
Download Presentation

Tools For HTS Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tools For HTS Analysis Michael Brudno and Marc Fiume Department of Computer Science University of Toronto

  2. Outline • Lab focus • Our tools • SHRiMP: read mapper • VARiD: SNP and indel finder • Savant: genome browser • Discussion

  3. Our Tools READ MAPPING (SHRiMP) ASSEMBLY (UNNAMED) VISUALIZATION (SAVANT) SNP DETECTION (VARiD) INDEL DETECTION (MODiL) CNV DETECTION (CNVer)

  4. SHRiMP – Short read mapping package Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  5. Key SHRiMP Features • High Sensitivity • Support for common formats (SAM, FASTQ, etc) • Flexible seeding framework • Multi-threading • Full support for SOLiD and Illumina (and 454) reads Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  6. Sensitivity/Specificity Comparison Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  7. Runtime comparison Mapping 6 million reads to C. Savignyi (180 Mb) Unpaired 50bp Reads Paired 75bp Reads Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  8. Varid – snp and indel detection Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  9. Variation detection from NGS reads • Determine differences (variation) between reference and donorusing NGS reads of the donor Reference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTG Aligned reads: ATCCATTGCA GATCCACTGCAC motivation | methods | results | summary

  10. Motivation • Color-spaceand Letter-space platforms • bring them together Methods Results Summary

  11. Sequencing Platforms > NC_005109.2 | BRCA1 SX3 TCAGCATCGGCATCGACTGCACAGG • letter-space • Sanger, 454, Illumina, etc > NC_005109.2 | BRCA1 AF3 T212313230313232121311120 • color-space • AB SOLiD • less software tools available • many differences -> useful to combine this information • sequencing biases • inherent errors • advantages motivation | methods | results | summary

  12. Color Space Translation Matrix Translation Automata motivation | methods | results | summary

  13. Color Space Translating > T212313230313232121311120 > T C A G C A T C G G C A T C G A C T G C A C A G G G A Sequencing Error vs SNP Sequencing Error > T212313230313232121311120 > T212313230310232121311120 > TCAGCATCGGCAAGCTGACGTGTCC SNP > TCAGCATCGGCATCGACTGCACAGG> TCAGCATCGGCAGCGACTGCACAGG > T212313230312332121311120 C T motivation | methods | results | summary

  14. Color Space • clear distinction between a sequencing error and a SNP • can this help us in SNP detection? sounds like it! • single color change  error, • 2 colors changed  (likely) SNP. Difficult Case Well covered bases Easy snp call reference in color-space reads position motivation | methods | results | summary

  15. Motivation • Motivation • variation caller to handle both letter-space & color-space reads • Detection • Heterozygous SNPs • Homozygous SNPs • Tri-allelic SNPs • small indels • account for various errors, quality values & misalignments • VARiD • system to make inferences on the donor bases • variation detection motivation | methods | results | summary

  16. Motivation • Methods • Simple HMM Model • states, emissions, transitions, FB • Extended HMM Model • gaps, diploids, exceptions Results Summary

  17. Hidden Markov Model (HMM) Statistical model for a system - states Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state We can observe the state’s emission (output) each state has a probability distribution over outputs S1 S2 S3 e1 e1 e1 e2 e2 e2 motivation | methods | results | summary

  18. Hidden Markov Model (HMM) • Apply HMM to variation detection: • we don’t know the state (donor), but • we can observe some output determined by the state (reads) motivation | methods | results | summary

  19. Hidden Markov Model (HMM) unknowndonor . . . . . B6 B7 B8 B9 . . . . . B6 B7 B7 B8 B8 B9 AA AA AA AC AC AC color 0 color 0 color 0 color 1 color 1 color 1 motivation | methods | results | summary

  20. States B6 B7 • Why pairs of letters? Handle colors. • AA and TT gives the same colors. Can’t just model colors • The donor could be: • letters: AA color 0 • letters: AC color 1 • : • letters: TT color 0 • 16 combinations motivation | methods | results | summary

  21. States and Transitions • States • 16 possible states • only look at second letter . . . B6 B7 B8 . . . B6 B7 B7 B8 AA AA : . . . . . . . . . AT • Transitions • only certain transitions allowed • when allowed, p(Xt|Xt-1) = freq(Xt) • each state depends only on the previous states (Markov Process) CA Possible States : CT GA : : TT TT motivation | methods | results | summary

  22. Emissions Unknown genome B7 B8 ..... B6B7B8B9 ..... color 0 T01020100311223 T1030101311223 T20100311223 Color reads color 1 color 0 ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC Letter reads AA AA AC motivation | methods | results | summary

  23. Emission Probabilities probability p(em|AA) AA emission color 0 1 – 3ε TT color 1 ε Same color emission distribution color 2 ε color 3 ε letters A 1-3ξ letters C ξ TT letters G ξ Different letter emission distribution letters T ξ motivation | methods | results | summary

  24. Emission Probabilities ..... B6B7B8B9 ..... • Combining emission probabilities • probability that this state emitted these reads. T01020100311223 T1030101311223 T20100311223 E.g. For state CC: ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC motivation | methods | results | summary

  25. Simple HMM • Summary • unknown state • donor pair at location • transitions • transition probabilities • emissions • reads at location • emission probabilities B6 B7 AA AC color 0 color 1 motivation | methods | results | summary

  26. Forward-Backward Algorithm • Have set-up a form of an HMM • run Forward-Backward algorithm • get probability distribution over states at some position AA : likely state AT CA : CT • Variation Detection: • compare most likely state with reference: • ref: GCTATCCA • don: ...AT... GA : : TT motivation | methods | results | summary

  27. Motivation • Methods • Simple HMM Model • states, emissions, transitions, FB • Extended HMM Model • gaps, diploids, exceptions Results Summary

  28. Extended HMM • Simple HMM • only detects homozygousSNPs • Extended HMM: • short indels • heterozygousSNPs • complex error profiles & quality values motivation | methods | results | summary

  29. Expansion: Gaps and het. SNPs • Expand states • Have states that include gaps • emit: gap or color A- -- -G AGTG • Have larger states, for diploids T-T- • Transitions built in similar fashion as before • Same algorithm, but in all we have 1600 states with very sparse transitions motivation | methods | results | summary

  30. Expansion • Emission probabilities • Support quality values • Use variable error rates for emissions Donor: ACAGCATCGGCATCGACTGC1123132303123321213read: >T2123132303123321213 > C123132303123321213 • Translate through the first letter • first color is incorrect • letter-space signal • Post-process putative SNPs • uncorrelated adjacent errors may support het SNPs • check putative SNPs motivation | methods | results | summary

  31. Summary blue: varid steps motivation | methods | results | summary

  32. Motivation Methods Results Summary

  33. Results • Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773) • Color-space dataset: • Compare random subsets: • Corona (with AB mapper) • VARiD (with SHRiMP) • VARiD (with AB mapper) • Conclusions: • Using F-measure, the three • pipelines perform very similarly. • High-coverage results is as good as can be achieved motivation | methods | results | summary

  34. Results • Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773) • Letter-space dataset: • Compare random subsets : • GigaBayes (with Mosaik) • VARiD (with SHRiMP) • VARiD (with Mosaik) • Conclusion: • Using F-measure the three • pipelines perform very similarly. • High-coverage results is as good as can be achieved motivation | methods | results | summary

  35. Results VARiD: Combining Letter-space and Color-space Datato achieve increased accuracy in at-cost comparison motivation | methods | results | summary

  36. Motivation Methods Results Summary

  37. Summary • Summary of VARiD • HMM modeling underlying donor • Treats color-space and letter-space together in the same framework • no translation – take advantage of each technology’s properties • accurately calls short SNPs, indels in both color- and letter-space • improved results with hybrid data. • Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) • Contact: varid@cs.utoronto.ca motivation | methods | results | summary

  38. SAVANT GENOME BROWSER Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  39. Challenge in Genomic Data Analysis • genomic data is generated in high volumes • interpretation and analysis challenge • typical pipeline employs many separate tools for computation and visualization Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  40. Tools for HTS data analysis • substantial disconnect between the processes of computational analysis and visualization Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  41. Tools for HTS data analysis • substantial disconnect between the processes of computational analysis and visualization Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  42. Savant Genome Browser • platform for integratedvisual analysis of genomic data • feature-rich genome browser • computationally extensible via plugin framework Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  43. (Very) Short List of Features Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  44. FEATURE demonstration • INTERFACE • HTS READ ALIGNMENTS • EXAMPLE PLUGINS: SNP FINDER Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  45. Power of visual analytics • task: find the correct parameter for command-line tool Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  46. Plugin Framework • unlocks the potential for performing visual analytics • mutually beneficial for both users and tool developers • for users: perform complex data analyses on-the-fly within a visual environment • for programmers: platform for simple development and deployment of various programs Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  47. ConclusionS Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  48. Conclusions • Savant is a platform for integrated visualization and analysis of genomic data • stand-alone genome browser • novel features: e.g. table view, visualization modes, data selection, etc. • computationally extensible through plugin framework • makes interpretation and analysis of genomic data easier and more efficient Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

  49. Acknowledgements Orion Vanessa Joe Nilgun Paul Vera Recep Andrew Vlad Mike Brudno Yue Marc Misko Yoni

  50. Questions? • SHRiMP • http://compbio.cs.toronto.edu/shrimp • VARiD • http://compbio.cs.toronto.edu/varid • Savant Genome Browser • http://compbio.cs.toronto.edu/savant

More Related