Bwt based compression algorithms compress better than you have thought
Download
1 / 79

PowerPoint presentation - PowerPoint PPT Presentation


  • 207 Views
  • Updated On :

BWT-Based Compression Algorithms compress better than you have thought Haim Kaplan, Shir Landau and Elad Verbin Talk Outline History and Results 10-minute introduction to Burrows-Wheeler based Compression. Our results: Part I: The Local Entropy Part II: Compression of integer strings

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'PowerPoint presentation' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Bwt based compression algorithms compress better than you have thought l.jpg

BWT-Based Compression Algorithms compress better than you have thought

Haim Kaplan, Shir Landau and Elad Verbin


Talk outline l.jpg
Talk Outline have thought

  • History and Results

  • 10-minute introduction to Burrows-Wheeler based Compression.

  • Our results:

  • Part I: The Local Entropy

  • Part II: Compression of integer strings


History l.jpg
History have thought

  • SODA ’99: Manzini presents the first worst-case upper bounds on BWT-based compression:

  • And:


History4 l.jpg
History have thought

  • CPM ’03 + SODA ‘04: Ferragina, Giancarlo, Manzini and Sciortino give Algorithm LeafCover which comes close to

  • However, experimental evidence shows that it doesn’t perform as well as BW0 (20-40% worse) Why?


Our results l.jpg
Our Results have thought

  • We Analyze only BW0.

  • We get: For every


Our results6 l.jpg
Our Results have thought

  • Sample values:


Our results7 l.jpg
Our Results have thought

  • We actually prove this bound vs. a stronger statistic:

  • (seems to be quite accurate)

  • Analysis through results which are of independent interest


Preliminaries l.jpg

Preliminaries have thought


Preliminaries9 l.jpg
Preliminaries have thought

  • Want to define:

  • Present the BW0 algorithm


Order 0 entropy l.jpg
order-0 entropy have thought

Lower bound for compression without context information


Order k entropy l.jpg
order-k entropy have thought

= Lower bound for compression with order-k contexts


Order k entropy12 l.jpg
order-k entropy have thought

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”


Algorithm bw0 l.jpg
Algorithm BW0 have thought

BW0=

BWT

+

MTF

+

Arithmetic


Bw0 burrows wheeler compression l.jpg
BW0: Burrows-Wheeler Compression have thought

Text in English (similar contexts -> similar character)

mississippi

BWT

Text with spikes (close repetitions)

ipssmpissii

Move-to-front

Integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0

Arithmetic

Compressed text

01000101010100010101010


The bwt l.jpg
The BWT have thought

  • Invented by Burrows-and-Wheeler (‘94)

  • Analogue to Fourier Transform (smooth!)

[Fenwick]

string with context-regularity

mississippi

BWT

ipssmpissii

string with spikes (close repetitions)


The bwt16 l.jpg
The BWT have thought

cyclic shifts of the text

sorted lexicographically

output of BWT

BWT sorts the characters by their context


The bwt17 l.jpg
The BWT have thought

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”

chars with context ‘i’

chars with context ‘p’

mississippi

BWT sorts the characters by their context


Manzini s observation l.jpg
Manzini’s Observation have thought

  • (Critical for Algorithm LeafCover)

  • BWT of the string = concatenation of order-k contexts

contexts


Move to front l.jpg
Move To Front have thought

  • By Bentley, Sleator, Tarjan and Wei (’86)

ipssmpissii

string with spikes (close repetitions)

move-to-front

integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0


Move to front20 l.jpg
Move to Front have thought

Sara Shara Shir Samech שרה שרה שיר שמח))


Move to front21 l.jpg
Move to Front have thought


Move to front22 l.jpg
Move to Front have thought


Move to front23 l.jpg
Move to Front have thought


Move to front24 l.jpg
Move to Front have thought


Move to front25 l.jpg
Move to Front have thought


Move to front26 l.jpg
Move to Front have thought


Move to front27 l.jpg
Move to Front have thought


After mtf l.jpg
After MTF have thought

  • Now we have a string with small numbers: lots of 0s, many 1s, …

  • Skewed frequencies: Run Arithmetic!

Character frequencies


Summary of bw0 l.jpg
Summary of BW0 have thought

BW0=

BWT

+

MTF

+

Arithmetic


Summary of bw030 l.jpg
Summary of BW0 have thought

  • bzip2: Achieves compression close to statistical coders, with running-time close to gzip.

  • Here we analyze only BW0.

  • 15% worse than bzip2.

  • Our upper bounds (roughly) apply to the better algorithms as well.


Our results part i the local entropy l.jpg

Our Results. have thought Part I: The Local Entropy


The local entropy l.jpg
The Local Entropy have thought

  • We have the string -- small numbers

  • In dream world, we could compress it to sum of logs.

  • Define


The local entropy33 l.jpg
The Local Entropy have thought

  • Let’s explore LE:

  • LE is a measure of local similarity:

  • LE(aaaaabbbbbbcccccddddd)=0


The local entropy34 l.jpg
The Local Entropy have thought

  • Example:

LE(abracadabra)=

3*log2+3*log3+1*log4+3*log5


The local entropy35 l.jpg
The Local Entropy have thought

  • We use two properties of LE:

  • BSTW:

  • Convexity:


Observation locality of mtf l.jpg
Observation: Locality of MTF have thought

  • MTF is sort of local:

  • Cutting the input string into parts doesn’t influence much: Only positions per part

  • So:

a b

a a a b a b


The local entropy37 l.jpg
The Local Entropy have thought

  • We use two properties of LE:

  • BSTW:

  • Convexity:


Slide38 l.jpg


Slide39 l.jpg


Slide40 l.jpg


Slide41 l.jpg


The local entropy42 l.jpg
The Local Entropy have thought

  • We use two properties of LE:

  • BSTW:

  • Convexity:


Slide43 l.jpg

contexts have thought


Slide44 l.jpg


Our results part ii compression of integer strings l.jpg

Our Results. have thought Part II: Compression of integer strings


Slide46 l.jpg


Compressing an integer sequence l.jpg
Compressing an integer sequence ?

  • Problem:

    Compress an integer sequence s’

    close to its sum of logs,


First solution l.jpg
First Solution ?

  • Universal Encodings of Integers

  • Integer x ->

  • Encoding bits


Better solution l.jpg
Better Solution ?

  • Doing some math, it turns out that Arithmetic compression is good.

  • Not only good: It is best!


The order0 math l.jpg
The order0 math ?

  • Theorem: For any string s’ and any ,



Notes l.jpg
Notes ?

  • This inequality is tight (there are s’ that produce equality)

  • There are a lot of such s’

  • Conclusion:No algorithm can get a bound better than this – for any value of

  • (also result for small alphabet size)



Summary l.jpg
Summary ?

  • Theorem: For every (simultaneously)


Notes56 l.jpg
Notes ?

  • One can get better results vs. .

  • However, here we show optimal results vs. . Experimentally, this is indeed a stronger result: The upper bounds we get are almost tight.


Minor open problems l.jpg
Minor open problems ?

  • Get the real bound vs.

  • Try to do it in an elegant manner, through an LE-like statistic

  • Why is there a logarithm in LE? Is there an interesting generalization?

  • Research compression of integer sequences


Main open problem l.jpg
Main Open Problem ?

  • Right now we have

  • But sometimes


Main open problem59 l.jpg
Main Open Problem ?

  • Find an algorithm A such that

  • And

  • And furthermore,





History and results l.jpg
History and Results ?

  • Results hold also for BW_RL

  • Elegant Analysis (we actually know better, but the analysis here is elegant and strong)


Local entropy note l.jpg
Local Entropy: Note ?

  • Definition can have many variants, influenced by the variants of MTF. They all lead to the same result on H_k, but have some variation in performance. We have not measured this variation, but it is an interesting question to find an algorithm that minimizes all of these variants simultaneously


Local entropy variants l.jpg
Local Entropy Variants ?

  • Detailed in the paper “Second Phase Algorithms…” by “…” are some alternatives for MTF

  • All that we require from LE to get the results on H_k are two properties:

  • LE<=n*H_0

  • LE is convex: LE(s_1)+LE(s_2)<=LE(s_1*s_2)+O(1)


Le relation to h 0 l.jpg
LE: Relation to H_0 ?

  • For LE we have a tight tradeoff: (\mu,\log(\zeta(\mu))+o(1)). For H_0 we have a tight result (1,o(1)).

  • Is there a connection between the two? We conceptually imagine that LE is kind of a rotated version of H_0.


Information theoretic consequences of le algorithm l.jpg
Information-Theoretic Consequences of LE algorithm ?

  • Let us look at a sequence of k integers >=1 with sum =S. Transform it to a string with k runs of ‘a’s seperated by ‘b’s. Then you get a string with LE equal to k, where the length is approximately S. you get that by running the standard algorithm (MTF+Order0), the representation’s size is upper bounded by mu*k+log(zeta(mu))*S+o(S)

  • It can be shown that this is also a lower bound using the reverse transformation. In fact, the two questions are equivalent (under the specific tradeoff parameters)


Representing a sequence general question history l.jpg
Representing a Sequence: General question: History ?

  • You are given a sequence of k integers >=1 with sum =S. Several solutions exist:


Representing a sequence general question l.jpg
Representing a Sequence: General question ?

  • You are given a sequence of k integers >=1 with sum =S. Of course you can’t get an algorithm A with a bound of the type A<=mu*k (for any mu), or with a bound of the type A<=C*S for C<1 (C=1 is achievable by simply writing a bit-vector of when the numbers end in a long string of length S)


Representing a sequence general question70 l.jpg
Representing a Sequence: General question ?

  • So, we have a tradeoff (mu,log(zeta(mu))) under parameters (k,S). What about if we are given two functions f,g from the integers to the integers, and we want to be effective vs. (f(s),g(s)), where f(s) denotes the sum of f over all elements of the sequence s?

  • For example, in the tradeoff that we have presented, f=1, g=id.

  • We actually had another tradeoff (but only for small numbers?): (mu,log(zeta(mu))) under f=log, g=1

  • Other known tradeoffs


Hierarchy l.jpg
Hierarchy ?

  • We use the fact that LE is convex to get that \dot{LE}<=H_k and therefore Q.E.D.



Battling h k directly l.jpg
Battling H_k Directly ?

  • Our results through LE aren’t optimal, as Manizini’s <=5*H^*_k theorem shows.

  • can we get <=2*H_k ??????


Experiments l.jpg
Experiments ?

  • Just skim this. Simply say that BW0 beats LeafCover, and why the statistics show this. Say why you think that LeafCover is an algorithm tailored to beat H_k, and not effective as BWT+MTF-based algorithms. (Like Manzini says in his conclusion, the beauty of BWT and then LE is that it uses the similarity that is present in adjacent contexts: It “flows” through the contexts.


Interesting open problem l.jpg
Interesting Open Problem ?

  • It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).

  • Indeed, LeafCover gets something like that

  • However, LeafCover does not compete well with \hat{LE}

  • Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}

  • Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0


Main open problem76 l.jpg
Main Open Problem ?

  • It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).

  • Indeed, LeafCover gets something like that

  • However, LeafCover does not compete well with \hat{LE}

  • Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}

  • Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0





ad