1 / 79

# PowerPoint presentation - PowerPoint PPT Presentation

BWT-Based Compression Algorithms compress better than you have thought Haim Kaplan, Shir Landau and Elad Verbin Talk Outline History and Results 10-minute introduction to Burrows-Wheeler based Compression. Our results: Part I: The Local Entropy Part II: Compression of integer strings

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'PowerPoint presentation' - andrew

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### BWT-Based Compression Algorithms compress better than you have thought

Haim Kaplan, Shir Landau and Elad Verbin

Talk Outline have thought

• History and Results

• 10-minute introduction to Burrows-Wheeler based Compression.

• Our results:

• Part I: The Local Entropy

• Part II: Compression of integer strings

History have thought

• SODA ’99: Manzini presents the first worst-case upper bounds on BWT-based compression:

• And:

History have thought

• CPM ’03 + SODA ‘04: Ferragina, Giancarlo, Manzini and Sciortino give Algorithm LeafCover which comes close to

• However, experimental evidence shows that it doesn’t perform as well as BW0 (20-40% worse) Why?

Our Results have thought

• We Analyze only BW0.

• We get: For every

Our Results have thought

• Sample values:

Our Results have thought

• We actually prove this bound vs. a stronger statistic:

• (seems to be quite accurate)

• Analysis through results which are of independent interest

### Preliminaries have thought

Preliminaries have thought

• Want to define:

• Present the BW0 algorithm

order-0 entropy have thought

Lower bound for compression without context information

order-k entropy have thought

= Lower bound for compression with order-k contexts

order-k entropy have thought

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”

Algorithm BW0 have thought

BW0=

BWT

+

MTF

+

Arithmetic

BW0: Burrows-Wheeler Compression have thought

Text in English (similar contexts -> similar character)

mississippi

BWT

Text with spikes (close repetitions)

ipssmpissii

Move-to-front

Integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0

Arithmetic

Compressed text

01000101010100010101010

The BWT have thought

• Invented by Burrows-and-Wheeler (‘94)

• Analogue to Fourier Transform (smooth!)

[Fenwick]

string with context-regularity

mississippi

BWT

ipssmpissii

string with spikes (close repetitions)

The BWT have thought

cyclic shifts of the text

sorted lexicographically

output of BWT

BWT sorts the characters by their context

The BWT have thought

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”

chars with context ‘i’

chars with context ‘p’

mississippi

BWT sorts the characters by their context

Manzini’s Observation have thought

• (Critical for Algorithm LeafCover)

• BWT of the string = concatenation of order-k contexts

contexts

Move To Front have thought

• By Bentley, Sleator, Tarjan and Wei (’86)

ipssmpissii

string with spikes (close repetitions)

move-to-front

integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0

Move to Front have thought

Sara Shara Shir Samech שרה שרה שיר שמח))

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

After MTF have thought

• Now we have a string with small numbers: lots of 0s, many 1s, …

• Skewed frequencies: Run Arithmetic!

Character frequencies

Summary of BW0 have thought

BW0=

BWT

+

MTF

+

Arithmetic

Summary of BW0 have thought

• bzip2: Achieves compression close to statistical coders, with running-time close to gzip.

• Here we analyze only BW0.

• 15% worse than bzip2.

• Our upper bounds (roughly) apply to the better algorithms as well.

### Our Results. have thought Part I: The Local Entropy

The Local Entropy have thought

• We have the string -- small numbers

• In dream world, we could compress it to sum of logs.

• Define

The Local Entropy have thought

• Let’s explore LE:

• LE is a measure of local similarity:

• LE(aaaaabbbbbbcccccddddd)=0

The Local Entropy have thought

• Example:

3*log2+3*log3+1*log4+3*log5

The Local Entropy have thought

• We use two properties of LE:

• BSTW:

• Convexity:

Observation: Locality of MTF have thought

• MTF is sort of local:

• Cutting the input string into parts doesn’t influence much: Only positions per part

• So:

a b

a a a b a b

The Local Entropy have thought

• We use two properties of LE:

• BSTW:

• Convexity:

The Local Entropy have thought

• We use two properties of LE:

• BSTW:

• Convexity:

contexts have thought

### Our Results. have thought Part II: Compression of integer strings

• Problem:

Compress an integer sequence s’

close to its sum of logs,

• Universal Encodings of Integers

• Integer x ->

• Encoding bits

• Doing some math, it turns out that Arithmetic compression is good.

• Not only good: It is best!

• Theorem: For any string s’ and any ,

• This inequality is tight (there are s’ that produce equality)

• There are a lot of such s’

• Conclusion:No algorithm can get a bound better than this – for any value of

• (also result for small alphabet size)

### Conclusion ?

• Theorem: For every (simultaneously)

• One can get better results vs. .

• However, here we show optimal results vs. . Experimentally, this is indeed a stronger result: The upper bounds we get are almost tight.

• Get the real bound vs.

• Try to do it in an elegant manner, through an LE-like statistic

• Why is there a logarithm in LE? Is there an interesting generalization?

• Research compression of integer sequences

• Right now we have

• But sometimes

• Find an algorithm A such that

• And

• And furthermore,

• Results hold also for BW_RL

• Elegant Analysis (we actually know better, but the analysis here is elegant and strong)

• Definition can have many variants, influenced by the variants of MTF. They all lead to the same result on H_k, but have some variation in performance. We have not measured this variation, but it is an interesting question to find an algorithm that minimizes all of these variants simultaneously

• Detailed in the paper “Second Phase Algorithms…” by “…” are some alternatives for MTF

• All that we require from LE to get the results on H_k are two properties:

• LE<=n*H_0

• LE is convex: LE(s_1)+LE(s_2)<=LE(s_1*s_2)+O(1)

• For LE we have a tight tradeoff: (\mu,\log(\zeta(\mu))+o(1)). For H_0 we have a tight result (1,o(1)).

• Is there a connection between the two? We conceptually imagine that LE is kind of a rotated version of H_0.

• Let us look at a sequence of k integers >=1 with sum =S. Transform it to a string with k runs of ‘a’s seperated by ‘b’s. Then you get a string with LE equal to k, where the length is approximately S. you get that by running the standard algorithm (MTF+Order0), the representation’s size is upper bounded by mu*k+log(zeta(mu))*S+o(S)

• It can be shown that this is also a lower bound using the reverse transformation. In fact, the two questions are equivalent (under the specific tradeoff parameters)

• You are given a sequence of k integers >=1 with sum =S. Several solutions exist:

• You are given a sequence of k integers >=1 with sum =S. Of course you can’t get an algorithm A with a bound of the type A<=mu*k (for any mu), or with a bound of the type A<=C*S for C<1 (C=1 is achievable by simply writing a bit-vector of when the numbers end in a long string of length S)

• So, we have a tradeoff (mu,log(zeta(mu))) under parameters (k,S). What about if we are given two functions f,g from the integers to the integers, and we want to be effective vs. (f(s),g(s)), where f(s) denotes the sum of f over all elements of the sequence s?

• For example, in the tradeoff that we have presented, f=1, g=id.

• We actually had another tradeoff (but only for small numbers?): (mu,log(zeta(mu))) under f=log, g=1

• We use the fact that LE is convex to get that \dot{LE}<=H_k and therefore Q.E.D.

• Our results through LE aren’t optimal, as Manizini’s <=5*H^*_k theorem shows.

• can we get <=2*H_k ??????

• Just skim this. Simply say that BW0 beats LeafCover, and why the statistics show this. Say why you think that LeafCover is an algorithm tailored to beat H_k, and not effective as BWT+MTF-based algorithms. (Like Manzini says in his conclusion, the beauty of BWT and then LE is that it uses the similarity that is present in adjacent contexts: It “flows” through the contexts.

• It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).

• Indeed, LeafCover gets something like that

• However, LeafCover does not compete well with \hat{LE}

• Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}

• Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0

• It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).

• Indeed, LeafCover gets something like that

• However, LeafCover does not compete well with \hat{LE}

• Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}

• Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0