Loading in 5 sec....

BWT-Based Compression Algorithms compress better than you have thoughtPowerPoint Presentation

BWT-Based Compression Algorithms compress better than you have thought

- 207 Views
- Updated On :

BWT-Based Compression Algorithms compress better than you have thought Haim Kaplan, Shir Landau and Elad Verbin Talk Outline History and Results 10-minute introduction to Burrows-Wheeler based Compression. Our results: Part I: The Local Entropy Part II: Compression of integer strings

Related searches for PowerPoint presentation

Download Presentation
## PowerPoint Slideshow about 'PowerPoint presentation' - andrew

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### BWT-Based Compression Algorithms compress better than you have thought

### Preliminaries have thought

### Our Results. have thought Part I: The Local Entropy

### Our Results. have thought Part II: Compression of integer strings

Haim Kaplan, Shir Landau and Elad Verbin

Talk Outline have thought

- History and Results
- 10-minute introduction to Burrows-Wheeler based Compression.
- Our results:
- Part I: The Local Entropy
- Part II: Compression of integer strings

History have thought

- SODA ’99: Manzini presents the first worst-case upper bounds on BWT-based compression:
- And:

History have thought

- CPM ’03 + SODA ‘04: Ferragina, Giancarlo, Manzini and Sciortino give Algorithm LeafCover which comes close to
- However, experimental evidence shows that it doesn’t perform as well as BW0 (20-40% worse) Why?

Our Results have thought

- We Analyze only BW0.
- We get: For every

Our Results have thought

- Sample values:

Our Results have thought

- We actually prove this bound vs. a stronger statistic:
- (seems to be quite accurate)
- Analysis through results which are of independent interest

Preliminaries have thought

- Want to define:
- Present the BW0 algorithm

order-0 entropy have thought

Lower bound for compression without context information

order-k entropy have thought

= Lower bound for compression with order-k contexts

order-k entropy have thought

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”

BW0: Burrows-Wheeler Compression have thought

Text in English (similar contexts -> similar character)

mississippi

BWT

Text with spikes (close repetitions)

ipssmpissii

Move-to-front

Integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0

Arithmetic

Compressed text

01000101010100010101010

The BWT have thought

- Invented by Burrows-and-Wheeler (‘94)
- Analogue to Fourier Transform (smooth!)

[Fenwick]

string with context-regularity

mississippi

BWT

ipssmpissii

string with spikes (close repetitions)

The BWT have thought

cyclic shifts of the text

sorted lexicographically

output of BWT

BWT sorts the characters by their context

The BWT have thought

mississippi:

Context for i: “mssp”

Context for s: “isis”

Context for p: “ip”

chars with context ‘i’

chars with context ‘p’

mississippi

BWT sorts the characters by their context

Manzini’s Observation have thought

- (Critical for Algorithm LeafCover)
- BWT of the string = concatenation of order-k contexts

contexts

Move To Front have thought

- By Bentley, Sleator, Tarjan and Wei (’86)

ipssmpissii

string with spikes (close repetitions)

move-to-front

integer string with small numbers

0,0,0,0,0,2,4,3,0,1,0

Move to Front have thought

Sara Shara Shir Samech שרה שרה שיר שמח))

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

Move to Front have thought

After MTF have thought

- Now we have a string with small numbers: lots of 0s, many 1s, …
- Skewed frequencies: Run Arithmetic!

Character frequencies

Summary of BW0 have thought

- bzip2: Achieves compression close to statistical coders, with running-time close to gzip.
- Here we analyze only BW0.
- 15% worse than bzip2.
- Our upper bounds (roughly) apply to the better algorithms as well.

The Local Entropy have thought

- We have the string -- small numbers
- In dream world, we could compress it to sum of logs.
- Define

The Local Entropy have thought

- Let’s explore LE:
- LE is a measure of local similarity:
- LE(aaaaabbbbbbcccccddddd)=0

The Local Entropy have thought

- We use two properties of LE:
- BSTW:
- Convexity:

Observation: Locality of MTF have thought

- MTF is sort of local:
- Cutting the input string into parts doesn’t influence much: Only positions per part
- So:

a b

a a a b a b

The Local Entropy have thought

- We use two properties of LE:
- BSTW:
- Convexity:

- Worst case string for LE: have thought
ab…z ab…z ab…z

- Worst case string for LE: have thought
ab…z ab…z ab…z

- MTF:
- LE:

- Worst case string for LE: have thought
ab…z ab…z ab…z

- MTF:
- LE:
- Entropy:

- Worst case string for LE: have thought
ab…z ab…z ab…z

- MTF:
- LE:
- Entropy:
- This *always* holds

The Local Entropy have thought

- We use two properties of LE:
- BSTW:
- Convexity:

contexts have thought

- So, in the dream world we could compress up to have thought
- What about reality? – How close can we come to ?

- What about reality? – How close can we come to ?
- Problem: Compress an integer sequence close to its sum of logs,

Compressing an integer sequence ?

- Problem:
Compress an integer sequence s’

close to its sum of logs,

First Solution ?

- Universal Encodings of Integers
- Integer x ->
- Encoding bits

Better Solution ?

- Doing some math, it turns out that Arithmetic compression is good.
- Not only good: It is best!

The order0 math ?

- Theorem: For any string s’ and any ,

- Proof: Denote ?
- Then

Notes ?

- This inequality is tight (there are s’ that produce equality)
- There are a lot of such s’
- Conclusion:No algorithm can get a bound better than this – for any value of
- (also result for small alphabet size)

Summary ?

- Theorem: For every (simultaneously)

Notes ?

- One can get better results vs. .
- However, here we show optimal results vs. . Experimentally, this is indeed a stronger result: The upper bounds we get are almost tight.

Minor open problems ?

- Get the real bound vs.
- Try to do it in an elegant manner, through an LE-like statistic
- Why is there a logarithm in LE? Is there an interesting generalization?
- Research compression of integer sequences

Main Open Problem ?

- Right now we have
- But sometimes

Main Open Problem ?

- Find an algorithm A such that
- And
- And furthermore,

The End. ?

The End. ?

The End. ?

History and Results ?

- Results hold also for BW_RL
- Elegant Analysis (we actually know better, but the analysis here is elegant and strong)

Local Entropy: Note ?

- Definition can have many variants, influenced by the variants of MTF. They all lead to the same result on H_k, but have some variation in performance. We have not measured this variation, but it is an interesting question to find an algorithm that minimizes all of these variants simultaneously

Local Entropy Variants ?

- Detailed in the paper “Second Phase Algorithms…” by “…” are some alternatives for MTF
- All that we require from LE to get the results on H_k are two properties:
- LE<=n*H_0
- LE is convex: LE(s_1)+LE(s_2)<=LE(s_1*s_2)+O(1)

LE: Relation to H_0 ?

- For LE we have a tight tradeoff: (\mu,\log(\zeta(\mu))+o(1)). For H_0 we have a tight result (1,o(1)).
- Is there a connection between the two? We conceptually imagine that LE is kind of a rotated version of H_0.

Information-Theoretic Consequences of LE algorithm ?

- Let us look at a sequence of k integers >=1 with sum =S. Transform it to a string with k runs of ‘a’s seperated by ‘b’s. Then you get a string with LE equal to k, where the length is approximately S. you get that by running the standard algorithm (MTF+Order0), the representation’s size is upper bounded by mu*k+log(zeta(mu))*S+o(S)
- It can be shown that this is also a lower bound using the reverse transformation. In fact, the two questions are equivalent (under the specific tradeoff parameters)

Representing a Sequence: General question: History ?

- You are given a sequence of k integers >=1 with sum =S. Several solutions exist:
- …

Representing a Sequence: General question ?

- You are given a sequence of k integers >=1 with sum =S. Of course you can’t get an algorithm A with a bound of the type A<=mu*k (for any mu), or with a bound of the type A<=C*S for C<1 (C=1 is achievable by simply writing a bit-vector of when the numbers end in a long string of length S)

Representing a Sequence: General question ?

- So, we have a tradeoff (mu,log(zeta(mu))) under parameters (k,S). What about if we are given two functions f,g from the integers to the integers, and we want to be effective vs. (f(s),g(s)), where f(s) denotes the sum of f over all elements of the sequence s?
- For example, in the tradeoff that we have presented, f=1, g=id.
- We actually had another tradeoff (but only for small numbers?): (mu,log(zeta(mu))) under f=log, g=1
- Other known tradeoffs

Hierarchy ?

- We use the fact that LE is convex to get that \dot{LE}<=H_k and therefore Q.E.D.

Battling H_k Directly ?

- Our results through LE aren’t optimal, as Manizini’s <=5*H^*_k theorem shows.
- can we get <=2*H_k ??????

Experiments ?

- Just skim this. Simply say that BW0 beats LeafCover, and why the statistics show this. Say why you think that LeafCover is an algorithm tailored to beat H_k, and not effective as BWT+MTF-based algorithms. (Like Manzini says in his conclusion, the beauty of BWT and then LE is that it uses the similarity that is present in adjacent contexts: It “flows” through the contexts.

Interesting Open Problem ?

- It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).
- Indeed, LeafCover gets something like that
- However, LeafCover does not compete well with \hat{LE}
- Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}
- Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0

Main Open Problem ?

- It is clear that one can have an exponential-time algorithm that comes within O(1) of H_k (for any fixed k).
- Indeed, LeafCover gets something like that
- However, LeafCover does not compete well with \hat{LE}
- Problem: Find an algorithm that gets within O(1) of H_k, and gets within mu,log(zeta(mu)) of \hat{LE}
- Idea: After BWT, Performing MTF+Arith is good vs. LE but not good vs. sum of entropies of an optimal partition. (and not even good vs. nH_0). Find A replacement: an algorithm A, that is good both vs. LE and vs. H_0 and even vs. sum of entropies of an optimal partition. Moreover: We want the algorithm to exhibit smooth behaviour: It should be good vs. All “rotations” of H_0

Download Presentation

Connecting to Server..