blockwise suffix sorting for space efficient burrows wheeler n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler PowerPoint Presentation
Download Presentation
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler

Loading in 2 Seconds...

play fullscreen
1 / 14

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler. Ben Langmead Based on work by Juha K ä rkk ä inen. Motivation. Burrows-Wheeler Transformation (BWT) of a large text allows: Fast exact matching Compact representation (compared to suffix tree/array)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler' - winifred-vazquez


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
blockwise suffix sorting for space efficient burrows wheeler

Blockwise Suffix Sorting forSpace-Efficient Burrows-Wheeler

Ben Langmead

Based on work by Juha Kärkkäinen

motivation
Motivation
  • Burrows-Wheeler Transformation (BWT) of a large text allows:
    • Fast exact matching
    • Compact representation (compared to suffix tree/array)
    • More readily compressible (basis of bzip)
  • The FM Index exploits an indexed and compressed BWT to allow:
    • Exact matching in time linear in the size of the pattern
    • Memory footprint as much as 50% smaller than original string
  • FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk
background
Background
  • Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array

a c a a c g $

g c $ a a a c

BWT

Text

Burrows

Wheeler

Matrix

Suffix array

Last column

problem
Problem
  • Memory footprint of building and storing suffix array is much larger than the BWT itself
    • Human genome: SA: ~12 GB, BWT: ~0.8 GB
    • Attempt to build BWT over whole human genome on a 32 GB server exhausts memory and crashes (I tried)
solution
Solution
  • Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting”
    • Theoretical Computer Science, 387 (3), pp. 249-257, Sept. 2007
  • Observation:
    • BWT[i] depends only on SA[i], not on any other element of SA
  • Corollary:
    • No need to keep all of SA in memory at once!
  • Solution:
    • Build SA and BWT a small “chunk” or “block” at a time
    • Greatly reduces the memory overhead
      • By something like a factor of B, where B = # of blocks
solution1
Solution
  • Typical suffix sort:
solution2
Solution
  • Blockwise suffix sort:
solution3
Solution
  • Calculate and sort a random sample of the suffixes
solution4
Solution
  • Samples are used as “bookends” for “buckets”

?

$

B1

B2

B3

B4

solution5
Solution
  • In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket

$

Pass 1

B1

B2

B3

B4

solution6
Solution
  • After a bucket has been sorted and turned into a BWT segment, it is discarded

$

Pass B

B1

B2

B3

B4

solution7
Solution
  • Good time bounds in the presence of long repeats require use of a difference cover sample
    • Acts like an oracle that determines relative lexicographical order of two suffixes that share a prefix of some length v
project goals
Project Goals
  • Basic goal:
    • Write a correct, usable library implementing blockwise SA sort and BWT building
    • Characterize performance and time/space tradeoffs
  • Stretch goals:
    • Fine-tune for performance and memory usage
    • Implement difference cover sample
      • Question: is this necessary for good performance on real-life inputs?
concluding remarks
Concluding Remarks
  • BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way