1 / 25

Biosequence Similarity Search on the Mercury System

Biosequence Similarity Search on the Mercury System. Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster Department of Computer Science and Engineering, Washington University in Saint Louis, MO. Supported by an NIH STTR Grant &

ciera
Download Presentation

Biosequence Similarity Search on the Mercury System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster Department of Computer Science and Engineering, Washington University in Saint Louis, MO Supported by an NIH STTR Grant & NSF Grants DBI-0237902, ITR-0313203, CCR-0217334

  2. Outline • Overview of BLAST • Overview of the Mercury system • Description of BLASTN algorithm • Algorithmic changes to BLASTN • Improvement in performance • Related work • Conclusion Washington University in St. Louis

  3. Basic Local Alignment Search Tool • Biosequence comparison software • Query sequence (new genome) to large database of known biosequences • Look for similar regions • Exponential growth of genomic databases • Longer time for searches to complete • Solutions • Perform comparison over multiple machines • Specialized hardware - Our Approach Washington University in St. Louis

  4. The Mercury System Washington University in St. Louis

  5. The Mercury System • Proximity to disk • Simple operations performed close to disk • Avoids CPU use • 400 Mbytes/s throughput from the disk • Concurrent Independent operation • Does not use processor cache cycles, memory or I/O buses • Reconfigurable logic • Logic can be tuned to the particular need of the application Washington University in St. Louis

  6. BLASTN • BLASTN • Both the query and the database are long DNA strings • Consist of {A, C, T, G} and some unknowns • Each stage processes lesser data • The stages become more computationally expensive Washington University in St. Louis

  7. BLASTN - Terminology Query …ACTGTGTTTCACTGACGGGTGT… Database …CTGTGTCCCCAACACTGCTGACGTAGAATCGTGTAG… ‘w-mer’ is a sequence of ‘w’ consecutive bases Washington University in St. Louis

  8. BLASTN - Pipeline - Stage 1 • Matches each ‘11-mer’ in query to database • Exact string matching • 83% of overall time is spent in this stage • Filters 92% of data entering this stage • Only 8% of data proceeds to the next stage Washington University in St. Louis

  9. BLASTN - Pipeline - Stage 2 • Extends the matches from stage 1 …ACTGTGTTTCACTGACGGGTGT… …GTGTCCCCAACATTTCACTGACGAGAATCGTGTAG… Washington University in St. Louis

  10. BLASTN - Pipeline - Stage 2 • Extends the matches from stage 1 • Allows mismatches of individual bases • Does not allow gaps in either the query or the database • Match score should be higher than threshold to proceed • 16% of pipeline time is spent is this stage • Only 2/100,000 of data entering this stage proceeds to the next stage Washington University in St. Louis

  11. BLASTN - Pipeline - Stage 3 • Extends the matches from stage 2 …ACCACTGTTTCACTGACG_GA_T_GT… …CTGTGTCCCCAC_GTTTCACTGACGAGAATCGTGTAG… Washington University in St. Louis

  12. BLASTN - Pipeline - Stage 3 • Extends the matches from stage 2 • Scores matches with Gaps inserted in both the sequences • Smith-Waterman dynamic programming algorithm • <1% of pipeline time is spent is this stage Washington University in St. Louis

  13. NCBI - BLASTN • Stage 1 (word matching) is implemented as a lookup table • Efficient only for certain word lengths (w= 11) • Performance degrades dramatically for larger query sizes Pentium-4 2.6GHz 1Gbyte RAM Washington University in St. Louis

  14. Firmware implementation - Stage 1 Bloom Filters Hash Lookup Redundancy Eliminator Matches ‘11-mers’ to query, but generates false-positives Eliminates false-positives from Bloom filters, obtain offset in query Discards matches that are close to one another Washington University in St. Louis

  15. Bloom filters operation 1 1 1 Programming the query into the bloom filter (processing query) K Hash Functions query ‘11-mer’ ‘m-bit’ vector Washington University in St. Louis

  16. Bloom filters operation ? ? ? Finding matches in the database 1: Potential match K Hash Functions database ‘11-mer’ 0: Not a match ‘m-bit’ vector Washington University in St. Louis

  17. Bloom filters operation Finding matches in the database ? 1*: Potential match K Hash Functions ? database ‘11-mer’ 0: Not a match ? ‘m-bit’ vector * False positives are eliminated using a hash table Washington University in St. Louis

  18. Bloom filter performance Washington University in St. Louis

  19. Performance analysis Firmware Vs. Software Stage 1 Washington University in St. Louis

  20. Overall system throughput Tputoverall = min (Tput1, Tput(2&3)) Washington University in St. Louis

  21. Stage 2 in firmware - Throughput Washington University in St. Louis

  22. Stage 2 in firmware - Speedup Washington University in St. Louis

  23. Related work • Hardware based commercial systems • Paracel GeneMatcherTM, used ASIC, and hence is inflexible • RDisk, FPGA based system with throughput of 60 Mbases/s for stage 1 • High-end commercial system • Paracel BLASTMachine2TM, 32 CPU linux cluster • 2.93 Mbases/s for 2.8 Mbase query • 2 times faster than 1-node Mercury BLASTN • TimeLogic DeCypherBLASTTM, FPGA based • 213 Kbases/s for a 16 Mbase query • Comparable to 1-node Mercury BLASTN Washington University in St. Louis

  24. Conclusion • BLASTN on the Mercury system • Bloom filters to improve performance of stage 1 • Efficient hash functions in hardware • 7x improvement in speed with only stage 1 firmware • >50x speedup with stage 2 implemented in firmware • Future work • Algorithmic changes to stage 2 • Efficient use of hardware capabilities • Other apps • BLASTP, BLASTX etc. Washington University in St. Louis

  25. Thank you Washington University in St. Louis

More Related