1 / 14

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler. Ben Langmead Based on work by Juha K ä rkk ä inen. Motivation. Burrows-Wheeler Transformation (BWT) of a large text allows: Fast exact matching Compact representation (compared to suffix tree/array)

Download Presentation

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blockwise Suffix Sorting forSpace-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen

  2. Motivation • Burrows-Wheeler Transformation (BWT) of a large text allows: • Fast exact matching • Compact representation (compared to suffix tree/array) • More readily compressible (basis of bzip) • The FM Index exploits an indexed and compressed BWT to allow: • Exact matching in time linear in the size of the pattern • Memory footprint as much as 50% smaller than original string • FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk

  3. Background • Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array a c a a c g $ g c $ a a a c BWT Text Burrows Wheeler Matrix Suffix array Last column

  4. Problem • Memory footprint of building and storing suffix array is much larger than the BWT itself • Human genome: SA: ~12 GB, BWT: ~0.8 GB • Attempt to build BWT over whole human genome on a 32 GB server exhausts memory and crashes (I tried)

  5. Solution • Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting” • Theoretical Computer Science, 387 (3), pp. 249-257, Sept. 2007 • Observation: • BWT[i] depends only on SA[i], not on any other element of SA • Corollary: • No need to keep all of SA in memory at once! • Solution: • Build SA and BWT a small “chunk” or “block” at a time • Greatly reduces the memory overhead • By something like a factor of B, where B = # of blocks

  6. Solution • Typical suffix sort:

  7. Solution • Blockwise suffix sort:

  8. Solution • Calculate and sort a random sample of the suffixes

  9. Solution • Samples are used as “bookends” for “buckets” ? $ B1 B2 B3 B4

  10. Solution • In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket $ Pass 1 B1 B2 B3 B4

  11. Solution • After a bucket has been sorted and turned into a BWT segment, it is discarded $ Pass B B1 B2 B3 B4

  12. Solution • Good time bounds in the presence of long repeats require use of a difference cover sample • Acts like an oracle that determines relative lexicographical order of two suffixes that share a prefix of some length v

  13. Project Goals • Basic goal: • Write a correct, usable library implementing blockwise SA sort and BWT building • Characterize performance and time/space tradeoffs • Stretch goals: • Fine-tune for performance and memory usage • Implement difference cover sample • Question: is this necessary for good performance on real-life inputs?

  14. Concluding Remarks • BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way

More Related