1 / 12

ISB

ISB. Ravi Pandya | Bill Bolosky Microsoft June 28 2012. Genomics project. Collaboration with UC Berkeley AMP Lab Dave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, … Long term: Cancer genomics

Download Presentation

ISB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISB Ravi Pandya | Bill Bolosky Microsoft June 28 2012

  2. Genomics project Collaboration with UC Berkeley AMP Lab Dave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, … Long term: Cancer genomics David Haussler (UC Santa Cruz): Cancer Genomics Hub / Cancer Genome Atlas (TCGA) 500 Tb (growing to 20 Pb) of tumor/normal genomes at San Diego Supercomputer Center Near term: Genome sequencing pipeline Motivated by Archon Genomics X-Prize (September 2013) 100 samples of DNA from centenarians (>105 years old) Sequence with best coverage, accuracy, and cost in 1 month Goal: 98% coverage, 99.9999% accuracy, $1000/genome Current tools (GATK, CLC) are not sufficient to meet the goal

  3. Genomics pipeline Fast, accurate, scalable Apply state-of-the art computer science to sequencing problem Machine learning, distributed systems, high-performance computing Open source for Windows+Linux | Windows Azure cloud service SNAP (available now) Fast aligner using hash-based index of entire genome 10-40x faster than BWA FLASH (in progress) Comprehensive probabilistic model Reference-based alignment + targeted de novo contig assembly + scaffold assembly

  4. Genomics pipeline SNAP Unaligned reads Aligned reads Hash clustering FLASH Optimization De novo assembly Scaffold assembly Call SNPs, indels, SVs

  5. SNAP Reference genome CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCA...GTTTAGCTCAAAGAG... Hash index of seed  {locations} AGCTCAAA GAAAGAA 1. Lookup seeds 2. Map locations 3. Score matches Read sequence CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCAG ~15 core-hours for 30x coverage

  6. FLASH SNAP aligner Genomic prior knowledge Machine learning models SNAP Sparse Matrices Alignment Candidate Assembly Candidate assembly Depth Likelihood Separation Coverage Pair distance Overlap Optimize

  7. Read alignment SNAP alignment Sequencing error Mutation frequency Variant databases Candidate Assembly Candidate Assembly 1 1 1 1 0.9 0.6 0.7 0.2 0.8 1B Reads 1B Reads Strands 3B bp Genome 3B bp Genome RGS = Read-Genome-Strand candidate assembly LRG = Likelihood of Read-Genome alignment

  8. Coverage distribution Assembly Sequencer characteristics Alignment data Assembly RGS Assembly Assembly 22 24 29 35 34 0.1 0.12 0.14 0.12 0.1 Strands 3B bp Genome Coverage GSC = Genome-Strand Coverage LC = Likelihood of Coverage

  9. Hash clustering Cluster unaligned reads with overlapping bases Starting point for assembling contigs 1. Count seeds 2. Bucket reads by seed 3. Connect overlapping reads 4. Cluster connected components 2 3 1 1 CGCAGCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAAC GCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAACTGGA CCGATCGTTTGAATTAGATGTATTAGAGGTTAGTACCCTAGCCTAGTCGTAAGA

  10. Targeted de novo assembly Contig “genome” Genomic prior knowledge Machine learning models calc infer Update Candidate Assembly Alignment Candidate assembly Depth Likelihood Coverage Separation Pair distance Overlap hash clusters Optimize

  11. Scaffold assembly Maximum likelihood model Optimized reference contigs + de novo unaligned contigs Explore space of possible arrangements into a sample genome Optimize P(observed reads | candidate genome) = sequencing error + coverage depth + pair distance Incremental calculation using sparse matrix model

  12. Next steps? … SNAP Apply to more datasets / platforms / organisms Validate accuracy / coverage FLASH Use Kaviar for population priors Different approaches to assembly / structural variation Biology What interesting research could this enable – scale, speed, accuracy, analysis?

More Related