bwt based compression algorithms compress better than you have thought l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
BWT-Based Compression Algorithms compress better than you have thought PowerPoint Presentation
Download Presentation
BWT-Based Compression Algorithms compress better than you have thought

Loading in 2 Seconds...

play fullscreen
1 / 79

BWT-Based Compression Algorithms compress better than you have thought - PowerPoint PPT Presentation


  • 212 Views
  • Uploaded on

BWT-Based Compression Algorithms compress better than you have thought Haim Kaplan, Shir Landau and Elad Verbin Talk Outline History and Results 10-minute introduction to Burrows-Wheeler based Compression. Our results: Part I: The Local Entropy Part II: Compression of integer strings

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'BWT-Based Compression Algorithms compress better than you have thought' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bwt based compression algorithms compress better than you have thought

BWT-Based Compression Algorithms compress better than you have thought

Haim Kaplan, Shir Landau and Elad Verbin

talk outline
Talk Outline
  • History and Results
  • 10-minute introduction to Burrows-Wheeler based Compression.
  • Our results:
  • Part I: The Local Entropy
  • Part II: Compression of integer strings
history
History
  • SODA ’99: Manzini presents the first worst-case upper bounds on BWT-based compression:
  • And:
history4
History
  • CPM ’03 + SODA ‘04: Ferragina, Giancarlo, Manzini and Sciortino give Algorithm LeafCover which comes close to
  • However, experimental evidence shows that it doesn’t perform as well as BW0 (20-40% worse) Why?
our results
Our Results
  • We Analyze only BW0.
  • We get: For every
our results6
Our Results
  • Sample values:
our results7
Our Results
  • We actually prove this bound vs. a stronger statistic:
  • (seems to be quite accurate)
  • Analysis through results which are of independent interest
preliminaries9
Preliminaries
  • Want to define:
  • Present the BW0 algorithm
order 0 entropy
order-0 entropy

Lower bound for compression without context information

order k entropy
order-k entropy

= Lower bound for compression with order-k contexts

order k entropy12
order-k entropy

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”

algorithm bw0
Algorithm BW0

BW0=

BWT

+

MTF

+

Arithmetic

bw0 burrows wheeler compression
BW0: Burrows-Wheeler Compression

Text in English (similar contexts -> similar character)

mississippi

BWT

Text with spikes (close repetitions)

ipssmpissii

Move-to-front

Integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0

Arithmetic

Compressed text

01000101010100010101010

the bwt
The BWT
  • Invented by Burrows-and-Wheeler (‘94)
  • Analogue to Fourier Transform (smooth!)

[Fenwick]

string with context-regularity

mississippi

BWT

ipssmpissii

string with spikes (close repetitions)

the bwt16
The BWT

cyclic shifts of the text

sorted lexicographically

output of BWT

BWT sorts the characters by their context

the bwt17
The BWT

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”

chars with context ‘i’

chars with context ‘p’

mississippi

BWT sorts the characters by their context

manzini s observation
Manzini’s Observation
  • (Critical for Algorithm LeafCover)
  • BWT of the string = concatenation of order-k contexts

contexts

move to front
Move To Front
  • By Bentley, Sleator, Tarjan and Wei (’86)

ipssmpissii

string with spikes (close repetitions)

move-to-front

integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0

move to front20
Move to Front

Sara Shara Shir Samech שרה שרה שיר שמח))

after mtf
After MTF
  • Now we have a string with small numbers: lots of 0s, many 1s, …
  • Skewed frequencies: Run Arithmetic!

Character frequencies

summary of bw0
Summary of BW0

BW0=

BWT

+

MTF

+

Arithmetic

summary of bw030
Summary of BW0
  • bzip2: Achieves compression close to statistical coders, with running-time close to gzip.
  • Here we analyze only BW0.
  • 15% worse than bzip2.
  • Our upper bounds (roughly) apply to the better algorithms as well.
the local entropy
The Local Entropy
  • We have the string -- small numbers
  • In dream world, we could compress it to sum of logs.
  • Define
the local entropy33
The Local Entropy
  • Let’s explore LE:
  • LE is a measure of local similarity:
  • LE(aaaaabbbbbbcccccddddd)=0
the local entropy34
The Local Entropy
  • Example:

LE(abracadabra)=

3*log2+3*log3+1*log4+3*log5

the local entropy35
The Local Entropy
  • We use two properties of LE:
  • BSTW:
  • Convexity:
observation locality of mtf
Observation: Locality of MTF
  • MTF is sort of local:
  • Cutting the input string into parts doesn’t influence much: Only positions per part
  • So:

a b

a a a b a b

the local entropy37
The Local Entropy
  • We use two properties of LE:
  • BSTW:
  • Convexity:
slide38
Worst case string for LE:

ab…z ab…z ab…z

slide39
Worst case string for LE:

ab…z ab…z ab…z

  • MTF:
  • LE:
slide40
Worst case string for LE:

ab…z ab…z ab…z

  • MTF:
  • LE:
  • Entropy:
slide41
Worst case string for LE:

ab…z ab…z ab…z

  • MTF:
  • LE:
  • Entropy:
  • This *always* holds
the local entropy42
The Local Entropy
  • We use two properties of LE:
  • BSTW:
  • Convexity:
slide44
So, in the dream world we could compress up to
  • What about reality? – How close can we come to ?
slide46
What about reality? – How close can we come to ?
  • Problem: Compress an integer sequence close to its sum of logs,
compressing an integer sequence
Compressing an integer sequence
  • Problem:

Compress an integer sequence s’

close to its sum of logs,

first solution
First Solution
  • Universal Encodings of Integers
  • Integer x ->
  • Encoding bits
better solution
Better Solution
  • Doing some math, it turns out that Arithmetic compression is good.
  • Not only good: It is best!
the order0 math
The order0 math
  • Theorem: For any string s’ and any ,
notes
Notes
  • This inequality is tight (there are s’ that produce equality)
  • There are a lot of such s’
  • Conclusion:No algorithm can get a bound better than this – for any value of
  • (also result for small alphabet size)
summary
Summary
  • Theorem: For every (simultaneously)
notes56
Notes
  • One can get better results vs. .
  • However, here we show optimal results vs. . Experimentally, this is indeed a stronger result: The upper bounds we get are almost tight.
minor open problems
Minor open problems
  • Get the real bound vs.
  • Try to do it in an elegant manner, through an LE-like statistic
  • Why is there a logarithm in LE? Is there an interesting generalization?
  • Research compression of integer sequences
main open problem
Main Open Problem
  • Right now we have
  • But sometimes
main open problem59
Main Open Problem
  • Find an algorithm A such that
  • And
  • And furthermore,
history and results
History and Results
  • Results hold also for BW_RL
  • Elegant Analysis (we actually know better, but the analysis here is elegant and strong)
local entropy note
Local Entropy: Note
  • Definition can have many variants, influenced by the variants of MTF. They all lead to the same result on H_k, but have some variation in performance. We have not measured this variation, but it is an interesting question to find an algorithm that minimizes all of these variants simultaneously
local entropy variants
Local Entropy Variants
  • Detailed in the paper “Second Phase Algorithms…” by “…” are some alternatives for MTF
  • All that we require from LE to get the results on H_k are two properties:
  • LE<=n*H_0
  • LE is convex: LE(s_1)+LE(s_2)<=LE(s_1*s_2)+O(1)
le relation to h 0
LE: Relation to H_0
  • For LE we have a tight tradeoff: (\mu,\log(\zeta(\mu))+o(1)). For H_0 we have a tight result (1,o(1)).
  • Is there a connection between the two? We conceptually imagine that LE is kind of a rotated version of H_0.
information theoretic consequences of le algorithm
Information-Theoretic Consequences of LE algorithm
  • Let us look at a sequence of k integers >=1 with sum =S. Transform it to a string with k runs of ‘a’s seperated by ‘b’s. Then you get a string with LE equal to k, where the length is approximately S. you get that by running the standard algorithm (MTF+Order0), the representation’s size is upper bounded by mu*k+log(zeta(mu))*S+o(S)
  • It can be shown that this is also a lower bound using the reverse transformation. In fact, the two questions are equivalent (under the specific tradeoff parameters)
representing a sequence general question history
Representing a Sequence: General question: History
  • You are given a sequence of k integers >=1 with sum =S. Several solutions exist:
representing a sequence general question
Representing a Sequence: General question
  • You are given a sequence of k integers >=1 with sum =S. Of course you can’t get an algorithm A with a bound of the type A<=mu*k (for any mu), or with a bound of the type A<=C*S for C<1 (C=1 is achievable by simply writing a bit-vector of when the numbers end in a long string of length S)
representing a sequence general question70
Representing a Sequence: General question
  • So, we have a tradeoff (mu,log(zeta(mu))) under parameters (k,S). What about if we are given two functions f,g from the integers to the integers, and we want to be effective vs. (f(s),g(s)), where f(s) denotes the sum of f over all elements of the sequence s?
  • For example, in the tradeoff that we have presented, f=1, g=id.
  • We actually had another tradeoff (but only for small numbers?): (mu,log(zeta(mu))) under f=log, g=1
  • Other known tradeoffs
hierarchy
Hierarchy
  • We use the fact that LE is convex to get that \dot{LE}<=H_k and therefore Q.E.D.
battling h k directly
Battling H_k Directly
  • Our results through LE aren’t optimal, as Manizini’s <=5*H^*_k theorem shows.
  • can we get <=2*H_k ??????
experiments
Experiments
  • Just skim this. Simply say that BW0 beats LeafCover, and why the statistics show this. Say why you think that LeafCover is an algorithm tailored to beat H_k, and not effective as BWT+MTF-based algorithms. (Like Manzini says in his conclusion, the beauty of BWT and then LE is that it uses the similarity that is present in adjacent contexts: It “flows” through the contexts.
interesting open problem
Interesting Open Problem
  • It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).
  • Indeed, LeafCover gets something like that
  • However, LeafCover does not compete well with \hat{LE}
  • Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}
  • Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0
main open problem76
Main Open Problem
  • It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).
  • Indeed, LeafCover gets something like that
  • However, LeafCover does not compete well with \hat{LE}
  • Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}
  • Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0