# BWT-Based Compression Algorithms compress better than you have thought - PowerPoint PPT Presentation

Download Presentation
BWT-Based Compression Algorithms compress better than you have thought

1 / 79
Download Presentation
BWT-Based Compression Algorithms compress better than you have thought
Download Presentation

## BWT-Based Compression Algorithms compress better than you have thought

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. BWT-Based Compression Algorithms compress better than you have thought Haim Kaplan, Shir Landau and Elad Verbin

2. Talk Outline • History and Results • 10-minute introduction to Burrows-Wheeler based Compression. • Our results: • Part I: The Local Entropy • Part II: Compression of integer strings

3. History • SODA ’99: Manzini presents the first worst-case upper bounds on BWT-based compression: • And:

4. History • CPM ’03 + SODA ‘04: Ferragina, Giancarlo, Manzini and Sciortino give Algorithm LeafCover which comes close to • However, experimental evidence shows that it doesn’t perform as well as BW0 (20-40% worse) Why?

5. Our Results • We Analyze only BW0. • We get: For every

6. Our Results • Sample values:

7. Our Results • We actually prove this bound vs. a stronger statistic: • (seems to be quite accurate) • Analysis through results which are of independent interest

8. Preliminaries

9. Preliminaries • Want to define: • Present the BW0 algorithm

10. order-0 entropy Lower bound for compression without context information

11. order-k entropy = Lower bound for compression with order-k contexts

12. order-k entropy mississippi: Context for i: “mssp” Context for s: “isis” Context for p: “ip”

13. Algorithm BW0 BW0= BWT + MTF + Arithmetic

14. BW0: Burrows-Wheeler Compression Text in English (similar contexts -> similar character) mississippi BWT Text with spikes (close repetitions) ipssmpissii Move-to-front Integer string with small numbers 0,0,0,0,0,2,4,3,0,1,0 Arithmetic Compressed text 01000101010100010101010

15. The BWT • Invented by Burrows-and-Wheeler (‘94) • Analogue to Fourier Transform (smooth!) [Fenwick] string with context-regularity mississippi BWT ipssmpissii string with spikes (close repetitions)

16. The BWT cyclic shifts of the text sorted lexicographically output of BWT BWT sorts the characters by their context

17. The BWT mississippi: Context for i: “mssp” Context for s: “isis” Context for p: “ip” chars with context ‘i’ chars with context ‘p’ mississippi BWT sorts the characters by their context

18. Manzini’s Observation • (Critical for Algorithm LeafCover) • BWT of the string = concatenation of order-k contexts contexts

19. Move To Front • By Bentley, Sleator, Tarjan and Wei (’86) ipssmpissii string with spikes (close repetitions) move-to-front integer string with small numbers 0,0,0,0,0,2,4,3,0,1,0

20. Move to Front Sara Shara Shir Samech שרה שרה שיר שמח))

21. Move to Front

22. Move to Front

23. Move to Front

24. Move to Front

25. Move to Front

26. Move to Front

27. Move to Front

28. After MTF • Now we have a string with small numbers: lots of 0s, many 1s, … • Skewed frequencies: Run Arithmetic! Character frequencies

29. Summary of BW0 BW0= BWT + MTF + Arithmetic

30. Summary of BW0 • bzip2: Achieves compression close to statistical coders, with running-time close to gzip. • Here we analyze only BW0. • 15% worse than bzip2. • Our upper bounds (roughly) apply to the better algorithms as well.

31. Our Results. Part I: The Local Entropy

32. The Local Entropy • We have the string -- small numbers • In dream world, we could compress it to sum of logs. • Define

33. The Local Entropy • Let’s explore LE: • LE is a measure of local similarity: • LE(aaaaabbbbbbcccccddddd)=0

34. The Local Entropy • Example: LE(abracadabra)= 3*log2+3*log3+1*log4+3*log5

35. The Local Entropy • We use two properties of LE: • BSTW: • Convexity:

36. Observation: Locality of MTF • MTF is sort of local: • Cutting the input string into parts doesn’t influence much: Only positions per part • So: a b a a a b a b

37. The Local Entropy • We use two properties of LE: • BSTW: • Convexity:

38. Worst case string for LE: ab…z ab…z ab…z

39. Worst case string for LE: ab…z ab…z ab…z • MTF: • LE:

40. Worst case string for LE: ab…z ab…z ab…z • MTF: • LE: • Entropy:

41. Worst case string for LE: ab…z ab…z ab…z • MTF: • LE: • Entropy: • This *always* holds

42. The Local Entropy • We use two properties of LE: • BSTW: • Convexity:

43. contexts

44. So, in the dream world we could compress up to • What about reality? – How close can we come to ?

45. Our Results. Part II: Compression of integer strings

46. What about reality? – How close can we come to ? • Problem: Compress an integer sequence close to its sum of logs,

47. Compressing an integer sequence • Problem: Compress an integer sequence s’ close to its sum of logs,

48. First Solution • Universal Encodings of Integers • Integer x -> • Encoding bits

49. Better Solution • Doing some math, it turns out that Arithmetic compression is good. • Not only good: It is best!

50. The order0 math • Theorem: For any string s’ and any ,