1 / 34

Accelerating Read Mapping with FastHASH

Accelerating Read Mapping with FastHASH. Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † Carnegie Mellon University § University of Washington ‡ University of California Los Angeles. Outline. Read Mapping and i ts Challenges

feleti
Download Presentation

Accelerating Read Mapping with FastHASH

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating Read Mapping with FastHASH HongyiXin†Donghyuk Lee†FarhadHormozdiari ‡ SamihanYedkar†Can Alkan§OnurMutlu† † Carnegie Mellon University § University of Washington ‡ University of California Los Angeles

  2. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Problem and Goal • Key Observations • Mechanisms • Results • Conclusion

  3. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Problem and Goal • Key Observations • Mechanisms • Results • Conclusion

  4. Read Mapping • A post-processing procedure after DNA sequencing • Map many short DNA fragments (reads) to a known reference genome with some minor differences allowed Reference genome Mapping short reads to reference genome is challenging (billions of 50-300 base pair reads) Reads DNA, physically DNA, logically

  5. Challenges • Need to find many mappings of each read • A short read may map to many locations, especially with Next Generation DNA Sequencing • How can we find all mappings efficiently? • Need to tolerate small variances/errors in each read • Each individual is different: Subject’s DNA may slightly differ from the reference (Mismatches, insertions, deletions) • How can we efficiently map each read with up to e errors present? • Need to map each read very fast (i.e., performance is important) • Human DNA is 3.2 billion base pairs long  Millions to billions of reads (State-of-the-art mappers take weeks to map a human’s DNA) • How can we design a much higher performance read mapper?

  6. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Preprocess the reference into a Hash Table • Use Hash Table to map reads • Problem and Goal • Key Observations • Mechanisms • Results

  7. Hash Table-Based Mappers [Alkan+ NG’09] Location list—where the k-mer occurs in reference gnome k-mer or 12-mer AAAAAAAAAAAA Reference genome AAAAAAAAAAAC NULL AAAAAAAAAAAT ...... CCCCCCCCCCCC ...... ...... ...... TTTTTTTTTTTT Once for a reference

  8. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Preprocess the reference into a Hash Table • Use Hash Table to map reads • Problem and Goal • Key Observations • Mechanisms • Results

  9. Hash Table-Based Mappers [Alkan+ NG’09] AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT read k-mers TTTTTTTTTTTT CCCCCCCCCCCC AAAAAAAAAAAA Reference Genome Hash Table (HT) 12 324 *** Invalid mapping Valid mapping ✔ AAAAAAAAAAAA ..****************************************.. …AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT… .. AAAAAAAAAAAAAACGCTTCCACCTTAATCTGGTTG.. CCCCCCCCCCCC AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT TTTTTTTTTTTT read Verification/Local Alignment

  10. Advantages of Hash Table Based Mappers • + Guaranteed to find all mappings • + Tolerate up to eerrors

  11. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Problem and Goal • Key Observations • Mechanisms • Results • Conclusion

  12. Problem and Goal • Poor performance of existing read mappers: Very slow • Verification/alignment takes too long to execute • Verification requires a memory access for reference genome + many base-pair wise comparisons between the reference and the read • Goal: Speed up the mapper by reducing the cost of verification 95%

  13. Reducing the Cost of Verification • We observe that most verification calculations are unnecessary • 1 out of 1000 potential locations passes the verification process • We also observe that we can get rid of unnecessary verification calculations by • Detecting and rejecting earlyinvalid mappings • Reducingthe number of potential mappings

  14. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Problem and Goal • Key Observations • Mechanisms • Results • Conclusion

  15. Key Observations • Observation 1 • Adjacent k-mers in the read should also be adjacent in the reference genome • Hence, mapper can quickly reject mappings that do not satisfy this property • Observation 2 • Some k-mers are cheaper to verify than othersbecause they have shorter location lists (they occur less frequently in the reference genome) • Mapper needs to examine only e+1 k-mers’ locations to tolerate eerrors • Hence, mapper can choose the cheapest e+1k-mersand verify their locations

  16. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Problem and Goal • Key Observations • Mechanisms • Results • Conclusion

  17. FastHASH Mechanisms • Adjacency Filtering (AF): Rejects obviously invalid mapping locations at early stage to avoid unnecessary verifications • Cheap K-mer Selection (CKS): Reduces the absolute number of potential mapping locations

  18. Adjacency Filtering (AF) • Goal:detect invalid mappings at early stage • Key Insight:For a valid mapping, adjacent k-mers in the read are also adjacent in the reference genome • Key Idea: search for adjacent locations in the k-mers’ location lists • If more than ek-mers fail—there must be more than e errors—invalid mapping read AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT Reference genome Invalid mapping Valid mapping

  19. Adjacency Filtering (AF) read AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT +12 +24 k-mers TTTTTTTTTTTT CCCCCCCCCCCC AAAAAAAAAAAA Reference Genome Hash Table (HT) 12 940 *** 557 324 569? 952? 336? 36? 24? AAAAAAAAAAAA …AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT… ✗ CCCCCCCCCCCC AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT TTTTTTTTTTTT

  20. FastHASH Mechanisms • Adjacency Filtering (AF): Rejects obviously invalid mapping locations at early stage to avoid unnecessary verifications • Cheap K-mer Selection (CKS): Reduces the absolute number of potential mapping locations

  21. Cheap K-mer Selection (CKS) • Goal:Reduce the number of potential mappings • Key insight: • K-mers have different cost to examine: Some k-mers are cheaperas they have fewer locations than others (occur less frequently in reference genome) • Key idea: • Sort the k-mers based on their number of locations • Select the k-mers with fewest locations to verify

  22. Cheap K-mer Selection read • e=2(examine 3 k-mers) AAGCTCAATTTCCCTCCTTAATTTTCCTCTTAAGAA GGGTATGGCTAG AAGGTTGAGAGCCTTAGGCTTACC Locations Number of Locations Expensive 3 k-mers Cheapest 3 k-mers Previous work needs to verify: 3004 locations FastHASH verifies only: 8 locations

  23. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Problem and Goal • Key Observations • Mechanisms • Results • Conclusion

  24. Methodology • Implemented FastHASH on top of state-of-the-art mapper: mrFAST • New version mrFAST-2.5.0.0 over mrFAST-2.1.0.6 • Tested with real read sets generated from Illumina platform • 1M reads of a human (160 base pairs) • 500K reads of a chimpanzee (101 base pairs) • 500K reads of a orangutan (70 base pairs) • Tested with simulated reads generated from reference genome • 1M simulated reads of human (180 base pairs) • Evaluation system • Intel Core i7 Sandy Bridge machine • 16 GB of main memory

  25. FastHASH Speedup human 19x chimpanzee orangutan simulated With FastHASH, new mrFAST obtains up to 19x speedup over previous version, without losing valid mappings

  26. Analysis • Reduction of potential mappings with FastHASH 99% 99% 99% 99% 99% FastHASHfilters out over 99% of the potential mappings without sacrificing any valid mappings

  27. Other Key Results (In the paper) • FastHASH finds all possible valid mappings • Correctly mapped all simulated reads (with fewer than e artificially added errors)

  28. Outline • Read Mapping and its Challenges • Hash Table-Based Mappers • Problem and Goal • Key Observations • Mechanisms • Results • Conclusion

  29. Conclusion • Problem: Existing read mappers perform poorly in mapping billions of short reads to the reference genome, in the presence of errors • Observation: Most of the verification calculations are unnecessary • Key Idea: To reduce the cost of unnecessary verification • Reject invalid mappings early (Adjacency Filtering) • Reduce the number of possible mappings to examine (Cheap K-mer Selection) • Key Result: FastHASH obtains up to 19x speedup over the state-of-the-art mapper without losing valid mappings

  30. Acknowledgements • Carnegie Mellon University (Hongyi Xin, Donghyuk Lee, SamihanYedkar and OnurMutlu, co-authors) • Bilkent University (Can Alkan, co-author) • University of Washington (Evan Eichlerand Can Alkan) • UCLA (FarhadHormozdiari, co-author) • NIH (National Institutes of Health) for financial support

  31. Thank you!  • Questions? • Download link to FastHASH • You can find the slides on SAFARI group website: • http://www.ece.cmu.edu/~safari

  32. Accelerating Read Mapping with FastHASH HongyiXin†Donghyuk Lee†FarhadHormozdiari ‡ SamihanYedkar†Can Alkan§OnurMutlu† † Carnegie Mellon University § University of Washington ‡ University of California Los Angeles

  33. Mapper Comparison: Number of Valid Mappings • Bowtie does not support error threshold larger than 3 FastHASH is able to find many more valid mappings than Bowtie and BWA

  34. Mapper Comparison: Execution Time • Bowtie does not support error threshold larger than 3 FastHASH is slower for e <= 3, but is much more comprehensive (can find many more valid mappings)

More Related