1 / 37

Document Compression

This article discusses the principles and implementation of arithmetic coding in document compression. It compares the ideal performance with practical applications and explores its use in JPEG/MPEG and bzip. The algorithm, symbol interval assignment, sequence interval coding, decoding examples, and LZW and LZ77 algorithms are explained. The article also touches on dictionary-based compressors and optimization techniques.

lcastaneda
Download Presentation

Document Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Compression Arithmetic coding

  2. Ideal performance. In practice, it is 0.02 * n Introduction It uses “fractional” parts of bits!! Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman) Used in JPEG/MPEG (as option), bzip More time costly than Huffman, but integerimplementation is not too bad.

  3. Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. 1.0 cum[c] = p(a)+p(b) = .7 cum[b] = p(a) = .2 cum[a] = .0 c = .3 0.7 b = .5 0.2 a = .2 0.0 The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

  4. 0.7 1.0 0.3 c c c = .3 0.7 0.55 0.27 b b b = .5 0.2 0.3 0.22 a a a = .2 0.0 0.2 0.2 Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) (0.3-0.2)*0.3=0.03 (0.7-0.2)*0.3=0.15 (0.7-0.2)*0.5 = 0.25 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1

  5. The algorithm To code a sequence of symbols with probabilitiespi (i = 1..n) use the following algorithm: 0.3 p(c) = .3 0.27 p(b) = .5 p(a) = .2 0.2

  6. The algorithm Each message narrows the interval by a factor p[Ti] Final interval size is Sequence interval [ ln , ln + sn ) Take a number inside

  7. 0.7 0.55 1.0 c c c = .3 0.7 0.55 0.475 b b b = .5 0.49 0.49 0.49 0.2 0.3 0.35 a a = .2 a 0.0 0.2 0.3 Decoding Example Decoding the number .49, knowing the message is of length 3: The message is bbc.

  8. How do we encode that number? If x = v/2k (dyadic fraction)then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

  9. How do we encode that number? Binary fractional representation: FractionalEncode(x) • x = 2 * x • If x < 1 output 0 • else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3 Incremental Generation

  10. Which number do we encode? Truncate the encoding to the first d = log2 (2/sn)bits Truncation gets a smaller number… how much smaller? ln + sn ln + sn/2 Truncation  Compression ln 0∞

  11. Bound on code length Theorem:For a text T of length n, the Arithmetic encoder generates at most log2 (2/sn)< 1 + log2 2/sn = 1 + (1 - log2 sn) = 2 - log2(∏ i=1,n p(Ti)) = 2 - log2(∏s [p(s)occ(s)]) = 2 - ∑s occ(s) * log2 p(s) ≈ 2 + ∑s ( n*p(s) ) * log2 (1/p(s)) = 2 + n H(T)bits T = acabc sn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2

  12. Document Compression Dictionary-based compressors

  13. LZ77 Algorithm’s step: • Output <dist, len, next-char> • Advance by len + 1 A buffer “window” has fixed length and moves a a c a a c a b c a a a a a a a c <6,3,a> Dictionary(all substrings starting here) a c a a c a a c a b c a a a a a a c <3,4,c>

  14. LZ77 Decoding Decoder keeps same dictionary window as encoder. • Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) • E.g. seen = abcd, next codeword is (2,9,e) • Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] • Output is correct: abcdcdcdcdcdce

  15. LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length)or(1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code

  16. Possiblybetterfor cache effects LZ78 Dictionary: • substrings stored in a trie (each has an id). Coding loop: • find the longest match S in the dictionary • Output its id and the next character c after the match in the input string • Add the substring Scto the dictionary Decoding: • builds the same dictionary and looks at ids

  17. LZ78: Coding Example Output Dict. 1 = a (0,a) a a b a a c a b c a b c b 2 = ab (1,b) a a b a a c a b c a b c b 3 = aa (1,a) a a b a a c a b c a b c b 4 = c (0,c) a a b a a c a b c a b c b 5 = abc (2,c) a a b a a c a b c a b c b 6 = abcb (5,b) a a b a a c a b c a b c b

  18. a 2 = ab (1,b) a a b 3 = aa (1,a) a a b a a 4 = c (0,c) a a b a a c 5 = abc (2,c) a a b a a c a b c 6 = abcb (5,b) a a b a a c a b c a b c b LZ78: Decoding Example Dict. Input 1 = a (0,a)

  19. LZW (Lempel-Ziv-Welch) [‘84] Don’t send extra character c, but still add Sc to the dictionary. The dictionary is initialized with byte values being the first 256 entries (e.g. a = 112, ascii), otherwise there is no way to start it up. The decoder is one step behind the coder since it does not know c • There is an issue for strings of the form SSc where S[0] = c, and these are handled specially!!!

  20. a a b a a c a b c a b c b a a b a a c a b c a b c b a a b a a c a b c a b c b a a b a a c a b c a b c b a a b a a c a b c a b c b a a b a a c a b c a b c b a a b a a c a b c a b c b LZW: Encoding Example Output Dict. 256=aa 112 257=ab 112 258=ba 113 259=aac 256 260=ca 114 261=abc 257 262=cab 260

  21. a a b a a c a b a b a c b a a b a a c a b a b a c b a a b a a c a b a b a c b a a b a a c a b a b a c b a a b a a c a b a b a c b a a b a a c a b a b a c b a a b a a c a b a b a c b LZW: Decoding Example Input Dict 112 one step later 112 256=aa 113 257=ab 256 258=ba 114 259=aac 257 260=ca 261 261=aba

  22. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: • How the dictionary is stored • How it is extended • How it is indexed • How elements are removed • How phrases are encoded LZ-algos are asymptotically optimal, i.e. their compression ratio goes to H(T) for n   !! No explicit frequency estimation

  23. You find this at: www.gzip.org/zlib/

  24. Google’s solution

  25. Document Compression Can we use simpler copy-detectors?

  26. Simple compressors: too simple? • Move-to-Front (MTF): • As a freq-sorting approximator • As a caching strategy • As a compressor • Run-Length-Encoding (RLE): • FAX compression

  27. Move to Front Coding Transforms a char sequence into an integersequence, that can then be var-length coded • Start with the list of symbols L=[a,b,c,d,…] • For each input symbol s • output the position of s in L [>=1] • move s to the front of L • L=[a,b,c,l] and S = cabala  mtf(S) = 3 2 3 2 4 2 Properties: • It is a dynamic code, with memory

  28. Run Length Encoding (RLE) If spatial locality is very high, then abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1) In case of binary strings  just numbers and one bit Properties: • It is a dynamic code, with memory

  29. Document Compression Burrows-Wheeler Transform and bzip2

  30. # mississipp i i #mississipp i ppi#mississ i ssippi#miss i ssissippi# m Sort the rows m ississippi# T p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi

  31. A famous example Much longer...

  32. L is highly compressible Algorithm Bzip : • Move-to-Front coding of L • Run-Length coding • Statistical coder Compressing L seems promising... Key observation: • L is locally homogeneous • Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

  33. SA L BWT matrix 12 11 8 5 2 1 10 9 7 4 6 3 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i Given SA and T, we have L[i] = T[SA[i]-1] Thisisone of the mainreasons for the number of pubblicationsspurred in ‘94-’10 on Suffix Array construction How to compute the BWT ? We said that: L[i] precedes F[i] in T #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m L[3] = T[ 8 - 1 ]

  34. Rankchar(pos) and Selectchar(pos) are key operations nowadays i ssippi#miss Can we map L’s chars onto F’s chars ? i ssissippi# m ... Need to distinguishequal chars... m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Take two equal L’s chars Rotate rightward their rows Same relative order !! A useful tool: L  F mapping F L unknown # mississipp i i #mississipp i ppi#mississ

  35. Two key properties: 1. LF-array maps L’s to F’s chars 2. L[ i ] precedes F[ i ] in T i ssippi#miss i ssissippi# m m ississippi# p p i T = .... # i p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i The BWT is invertible F L unknown # mississipp i i #mississipp i ppi#mississ Reconstruct T backward: Severalissuesaboutefficiency in time and space

  36. # at 16 Mtf = [i,m,p,s] Alphabet |S|+1 An encoding example T = mississippimississippimississippi L = ipppssssssmmmii#pppiiissssssiiiiii Mtf = 131141111141141 411211411111211111 Bin(6)=110, Wheeler’s code RLE1 = 03141041403141410210 Bzip2-output = 16 the original Mtf-list (i,m,p,s) [also S] Statistical code for |S|+1 symbols

  37. You find this in your Linux distribution

More Related