1 / 16

SNAP: Fast, accurate sequence alignment enabling biological applications

SNAP: Fast, accurate sequence alignment enabling biological applications. Ravi Pandya, Microsoft Research ASHG 10/19/2014. SNAP.

Download Presentation

SNAP: Fast, accurate sequence alignment enabling biological applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

  2. SNAP SNAP is fast * Align 50x genome in 1.2 hours (BWA-MEM = 11.75 hours)Sort + index + markdup BAM in 2 hours (samtools+sambamba = 4.25 hours)SNAP is as accurate as BWA-MEM, Bowtie2, etc. ROC on simulated data% aligned on real dataVariant calls on real data * NA12878:ERR194147, Azure D14 (16 cores, 112GB RAM, 800GB SSD)

  3. Sequence alignment The problem: Given a read R and a reference genome G Find the position in p in G that minimizes EditDistance(R, G[p .. p + |R|]) SNAP solves this quickly and accurately because of: Efficient system architecture Reducing the number of comparisons Reducing the cost of comparisons

  4. System architecture temp file empty full align sort mergesort index mark duplicates compress async read async write

  5. The sequence alignment problem Bill Bolosky, MSR CDF of per-read/pair alignment time, NA18705 169M pairs (using deeper search parameters than current defaults) The easy part: 97% of 20-mersin the human genomeoccur only oncebut at only 75% of locations The hard part: The other 3% of 20-mersand 25% of locations 10% of reads 95% of time

  6. Hash table lookup Bill Bolosky, MSR Build a multi-valued map (~30GB for hg19) from all seeds S in G  all locations of S in G 330 reads/s For all seeds in read, all locations of seed in genome, Score implied alignment of read, keep the best 42x 14k reads/s Ignore frequent seeds (>300 occurrences) Only use a few seeds/read

  7. Fast scoring Bill Bolosky, MSR 6.6x 92k reads/s O(n2)  Ukkonen O(nd), n=len, d=min(limit, actual) Use limit = best score so far + 2 (for MAPQ) 1.2x 113k reads/s Sort candidates by # of seed hits 1.4x 154k reads/s (470x overall) Skip locations with #seed misses > limit

  8. Bill Bolosky, MSR Paired-end alignment Find & score candidate location pairs C(R1:R2) = C(R1) ∩ C(R2) {± insert size} Enumerate in O(h log n) h = |C(R1) ∩ C(R2)| n = |C(R1)| + |C(R2)| Increases accuracy by allowingmuch higher limit on seed occurrences(e.g. 4k vs 300)

  9. Results: simulated data Mason-generated paired-end 100bp reads

  10. Results: real data NA18507 (Illumina HiSeq 50x) * AWS cr1.8xlarge (32 cores, 244GB RAM, 2x120GB SSD)

  11. Results: GATK variant calls Broad GATK pipeline, curated NA12878 variant calls

  12. Results: NIST Genome-in-a-Bottle 11.75 Appistry GATK pipeline, GIAB highly confident callsLonger seeds are much faster, similar precision/recall ERR194147*.fastq.gz, Azure D14 (16 cores, 112GB RAM, 800GB SSD)

  13. Results: NIST Genome-in-a-Bottle Lower confidence calls (qual>20, 2 platforms)

  14. Pathogen ID: SURPI (Charles Chiu, UCSF) “This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete, Chiu said.”

  15. SURPI Charles Chiu, UCSF SNAP enables SURPIwith: Fast filtering mode 64-bit index for >40GB ntDB Secondary mapping output

  16. Acknowledgements UC Berkeley AMPLab Matei Zaharia Kristal Curtis Armando Fox Scott Shenker Ion Stoica David Patterson Microsoft Research Bill Bolosky Ravi Pandya UC San Francisco Taylor Sittler Broad InstituteChristopher Hartl Binaries, source, documentation (Apache 2.0 licensed) http://snap.cs.berkeley.edu

More Related