1 / 27

Data Compression

Data Compression. Lempel-Ziv algorithms Burrows-Wheeler. Lempel-Ziv Algorithms. LZ77 (Sliding Window) Variants: LZSS (Lempel-Ziv-Storer-Szymanski) Applications: gzip, Squeeze, LHA, PKZIP LZ78 (Dictionary Based) Variants: LZW (Lempel-Ziv-Welch), LZC (Lempel-Ziv-Compress)

shalin
Download Presentation

Data Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Compression Lempel-Ziv algorithms Burrows-Wheeler

  2. Lempel-Ziv Algorithms LZ77 (Sliding Window) • Variants: LZSS (Lempel-Ziv-Storer-Szymanski) • Applications: gzip, Squeeze, LHA, PKZIP LZ78 (Dictionary Based) • Variants: LZW (Lempel-Ziv-Welch), LZC (Lempel-Ziv-Compress) • Applications: compress, GIF, CCITT (modems), ARC, PAK Traditionally LZ77 was better but slower, but the gzip version is almost as fast as any LZ78.

  3. a a c a a c a b c a b a b a c previously coded LZ overview Cursor Find the longest prefix of S[i,m] that appears in S[1,i-1] write its position and length

  4. a a c a a c a b c a b a b a c a(1,1)c(1,3)(1,1)b(5,3)(6,2)(2,2) Implement by maintaining a suffix tree for the coded part  linear time and space (e.g. via Ukkonen)

  5. a a c a a c a b c a b a b a c Dictionary(previously coded) Lookahead Buffer LZ77: Sliding Window Cursor Dictionary and buffer “windows” are fixed length and slide with the cursor On each step: • Output (p,l,c)p = relative position of the longest match in the dictionaryl = length of longest matchc = next char in buffer beyond longest match • Advance window by l + 1

  6. (1,1,c) a a c a a c a b c a b a a a c (3,4,b) a a c a a c a b c a b a a a c (3,3,a) a a c a a c a b c a b a a a c (1,2,c) a a c a a c a b c a b a a a c LZ77: Example (0,0,a) a a c a a c a b c a b a a a c Dictionary (size = 6) Longest match Next character

  7. LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically use the second format if length < 3. (1,a) a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,c) a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c

  8. Optimizations used by gzip (cont.) • Huffman code the positions, lengths and chars • Non greedy: possibly use shorter match so that next match is better • Use hash table to store dictionary. • Hash is based on strings of length 3. • Find the longest match within the correct hash bucket. • Limit on length of search. • Store within bucket in order of position

  9. LZ78: Dictionary Lempel-Ziv Basic algorithm: • Keep dictionary of words with integer id for each entry (e.g. keep it as a trie). • Coding loop • find the longest match S in the dictionary • Output the entry id of the match and the next character past the match from the input (id,c) • Add the string Sc to the dictionary • Decoding keeps same dictionary and looks up ids

  10. LZ78: Coding Example Dict. Output 1 = a (0,a) a a b a a c a b c a b c b 2 = ab (1,b) a a b a a c a b c a b c b 3 = aa (1,a) a a b a a c a b c a b c b 4 = c (0,c) a a b a a c a b c a b c b 5 = abc (2,c) a a b a a c a b c a b c b 6 = abcb (5,b) a a b a a c a b c a b c b

  11. a a a b a a b a a a a b a a c a a b a a c a b c a a b a a c a b c a b c b LZ78: Decoding Example Dict. Input 1 = a (0,a) 2 = ab (1,b) 3 = aa (1,a) 4 = c (0,c) 5 = abc (2,c) 6 = abcb (5,b)

  12. LZW (Lempel-Ziv-Welch) [Welch84] Don’t send extra character c, but still add Sc to the dictionary. The dictionary is initialized with byte values being the first 256 entries (e.g. a = 112, ascii), otherwise there is no way to start it up. The decoder is one step behind the coder since it does not know c • Only an issue for strings of the form SSc where S[0] = c, and these are handled specially

  13. LZW: Encoding Example Output Dict. 256=aa 112 a a b a a c a b c a b c b 257=ab 112 a a b a a c a b c a b c b 258=ba 113 a a b a a c a b c a b c b 259=aac 256 a a b a a c a b c a b c b 260=ca 114 a a b a a c a b c a b c b 261=abc 257 a a b a a c a b c a b c b 262=cab 260 a a b a a c a b c a b c b

  14. a a b a a c a b c a b c b LZW: Decoding Example Input Dict 112 112 256=aa a a b a a c a b c a b c b 113 257=ab a a b a a c a b c a b c b 256 258=ba a a b a a c a b c a b c b 114 259=aac a a b a a c a b c a b c b 257 260=ca a a b a a c a b c a b c b 260 261=abc a a b a a c a b c a b c b

  15. LZ78 and LZW issues What happens when the dictionary gets too large? • Throw the dictionary away when it reaches a certain size (used in GIF) • Throw the dictionary away when it is no longer effective at compressing (used in unix compress) • Throw the least-recently-used (LRU) entry away when it reaches a certain size (used in BTLZ, the British Telecom standard)

  16. Burrows -Wheeler Currently best “balanced” algorithm for text Basic Idea: • Sort the characters by their full context (typically done in blocks). This is called the block sorting transform. • Use move-to-front encoding to encode the sorted characters. The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

  17. a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a שלב I : יצירת מטריצה Mשבנויה מכל ההזזות הציקליות של S: S = abraca M =

  18. # a b r a c a a # a b r a c a b r a c a # a c a # a b r b r a c a # a c a # a b r a r a c a # a b שלב II : מיון השורות בסדר לקסיקוגרפי: F L L is the Burrows Wheeler Transform

  19. Claim: Every column contains all chars. F L a b r a c a # # a b r a c a a # a b r a c b r a c a # a a b r a c a # r a c a # a b a c a # a b r a c a # a b r b r a c a # a c a # a b r a c a # a b r a a # a b r a c # a b r a c a r a c a # a b You can obtain F from L by sorting

  20. F L # a b r a c a a # a b r a c a b r a c a # a c a # a b r b r a c a # a c a # a b r a r a c a # a b The “a’s” are in the same order in L and in F, Similarly for every other char.

  21. From L you can reconstruct the string F L # a a c a # a r b a c a r b What is the first char of S ?

  22. From L you can reconstruct the string F L # a a c a # a r b a c a r b a What is the first char of S ?

  23. From L you can reconstruct the string F L # a a c a # a r b a c a r b ab

  24. From L you can reconstruct the string F L # a a c a # a r b a c a r b abr

  25. Compression ? L a c Compress the transform to a string of integers usign move to front # r a 0 2 3 2 0 3 a Then use Huffman to code the integers b

  26. Why is it good ? F L a b r a c a # # a b r a c a a # a b r a c b r a c a # a a b r a c a # r a c a # a b a c a # a b r a c a # a b r b r a c a # a c a # a b r a c a # a b r a a # a b r a c # a b r a c a r a c a # a b Characters with the same (right) context appear together

  27. Sorting is equivalent to computing the suffix array. F L a b r a c a # # a b r a c a a # a b r a c b r a c a # a a b r a c a # r a c a # a b a c a # a b r a c a # a b r b r a c a # a c a # a b r a c a # a b r a a # a b r a c # a b r a c a r a c a # a b Can encode and decode in linear time

More Related