Efficient Data Coding: Huffman & Arithmetic Algorithms

Data CodingRun Length Coding • Rather than code each symbol of a lengthy run of the same symbol, • Describe the run as a symbol and a run length of that symbol.

Data CodingEntropy • The Entropy of a sequence of symbols is the theoretical minimum average number of bits that can represent a sequence of symbols, given the probability of occurrence of each symbol of the sequence. • Suppose we had a coin flicking experiment recording the number of times or probability a head or tail occurs: Prob(head) = 0.5 Prob(tail) = 0.5 • Log2(1/prob(symbol))gives the optimum number of bits required to encode a symbol given its probability of occurrence: Log2(1/prob(head)) = 1 (“0”) Log2(1/prob(tail)) = 1 (“1”) • Multiply each with the probability of occurrence of the symbol provides the average number of bits required to encode a symbol: [prob(head) x Log2(1/prob(head))] + [prob(tail) x Log2(1/prob(tail))] [0.5 x 1] + [0.5 x 1] = 1 • In practice the number of symbols and probability of occurrence of each symbol is more complicated than this simple example, so an intuitive assignment of bits to represent symbols is not possible. Instead the following algorithms have been devised to achieve this: Shannon-Fano, Huffman, Arithmetic and Lempel-Ziv-Welch.

Data CodingHuffman Coding • Encoding for Huffman Coding Algorithm (A bottom-up approach ): • Initialization: Put all nodes in an OPEN list, keep it sorted at all times (e.g., ABCDE). • Repeat until the OPEN list has only one node left: • (a) From OPEN pick two nodes having the lowest frequencies/probabilities, create a parent node of them. • (b) Assign the sum of the children's frequencies/probabilities to the parent node and insert it into OPEN. • (c) Assign code 0, 1 to the two branches of the tree, and delete the children from OPEN.

Data CodingHuffman Coding: Encoding Algorithm • Encoding for Huffman Coding Algorithm (with sorting after each merge) 1 (39) (24) A (15) 0 1 (13) (24) (11) 0 1 B (7) (13) C (6) 0 1 D (6) (11) E (5) 0

Data CodingHuffman Coding: Encoding Algorithm loge(x) = y x = ey log10(x) = log10(ey) log10(x) = ylog10(e) y = [log10(x)/log10(e)] 39 TOTAL (# of bits): 87 entropy = (15 x 1.38 + 7 x 2.48 + 6 x 2.7 + 6 x 2.7 + 5 x 2.96) / 39 = 85.26 / 39 = 2.19 Number of bits needed for Huffman Coding: 87/39 = 2.23

Data CodingAdaptive Huffman Coding: Motivations • The previous algorithms requires apriori statistical knowledge which is often not available (e.g., live audio, video). • Even when it is available, it could be a heavy overhead especially when many tables had to be sent when a non-order 0 model is used, i.e. taking into account the impact of the previous symbol to the probability of the current symbol (e.g., "qu" often come together, ...). • The solution is to use adaptive algorithms.

Data CodingAdaptive Huffman Coding: Algorithm ENCODER DECODER ------------ ------------ Initialize_model(); Initialize_model(); while ((c = getc (input)) != eof) while ((c = decode (input)) != eof) { { encode (c, output); putc (c, output); update_model (c); update_model (c); } } • encoder and decoder use exactly the same initialization and update_model routines.

Data CodingAdaptive Huffman Coding: Example • update_model does two things: (a) increment the count, (b) update the Huffman tree. • During the updates, the Huffman tree will maintained its sibling property, i.e. the nodes (internal and leaf) are arranged in order of increasing weights (see figure).

Data CodingAdaptive Huffman Coding: Example (cont.) • When swapping is necessary, the farthest node with weight W is swapped with the node whose weight has just been increased to W+1. Note: If the node with weight W has a subtree beneath it, then the subtree will go with it. • The Huffman tree could look very different after node swapping, e.g., in the third tree, node A is again swapped and becomes the #5 node. It is now encoded using only 2 bits.

Data CodingAdaptive Huffman Coding: Example (cont.)

Data CodingArithmetic Coding: Algorithm • A message is represented by an interval of real numbers (floating point) between 0.0 and 1.0. • As message becomes larger, the interval needed to represent it becomes smaller, and the number of bits needed to specify that interval grows. • A single number (floating point) in the interval can be uniquely decoded to create the exact stream of symbols that went into its construction.

Data CodingArithmetic: Example • Consider letters (a, e, i, o, u, EOS), where EOS represents end of message • As an example the message "eaiiEOS" is coded

Data CodingArithmetic: Example

Data CodingLempel-Ziv-Welch Compression Algorithms • Huffman and Arithmetic coding assume a stationary source. • Lempel-Ziv-Welch is an adaptive lossless coding technique which “learns” its symbols as it codes. • Original methods due to Ziv and Lempel in 1977 and 1978. Terry Welch improved the scheme in 1984 (called LZW compression). • It is used in e.g., zip, gzip, pkzip, winzip, GIF, V.42 bis, Stacker. Reference: Terry A. Welch, "A Technique for High Performance Data Compression", IEEE Computer, Vol. 17, No. 6, 1984, pp. 8-19.

Data Coding: Lempel-Ziv (LZ78) Binary Compression Algorithm • LZ78 algorithm maintains a stack or tree containing all the phrases into which it has divided the portion of data sequence it has parsed so far. • The next phrase is formed by concatenating two items 1. The phrase in the structure that achieves the longest match with the beginning of the as yet unparsed portion of the data, 2. The source datum beyond the end of this maximal match.

Data CodingLempel-Ziv LZ78 Binary Coding Example • Input string is: 0, 00, 1, 01, 10, 000, 010, 100, 1001, 0001, 001, 1 ....... LZ78 Tree is:

Data CodingLZ78S, LZ78E Compression Algorithm • LZ78S algorithm improves compression performance by noting that when a node is specified for the second time as being an ancestor node for the current phrase, everyone immediately knows that this phrase must end with the remaining descendant of that node rather than the descendant that was appended earlier. (E.g. if 0 ancestor has already been selected then we know the other ancestor must be 1) • LZ78E algorithm improves compression performance by noting that • after phrase 3 has been parsed, root node will never be used as the ancestor node because both its descendants are now in tree. • after phrase 4 has been parsed, node 1 will never be used as the ancestor node because both its descendants are now in the tree. • These “dead” nodes now no longer need to be coded.

Data CodingLempel-Ziv-Welch Compression Algorithm • Given Webster's English dictionary which contains about 159,000 entries find a way to build the dictionary adaptively. w = NIL; while ( read a character k ) { if wk exists in the dictionary w = wk; else add wk to the dictionary; output the code for w; w = k; }

Data CodingLempel-Ziv-Welch Coding: Example • Input string is "^WED^WE^WEE^WEB^WET". • A 19-symbol input has been reduced to 7-symbol plus 5-code output. • Each code/symbol will need more than 8 bits, say 9 bits. • Usually, compression doesn't start until a large number of bytes (e.g., > 100) are read in.

Data CodingLempel-Ziv-Welch Decompression Algorithm read a character k; output k; w = k; while ( read a character k ) /* k could be a character or a code. */ { entry = dictionary entry for k; output entry; add w + entry[0] to dictionary; w = entry; }

Data Coding: Lempel-Ziv-Welch Decoding Example • Input string is "^WED<256>E<260><261><257>B<260>T".

Data CodingBinary Lempel-Ziv-Welch Coding: Example • Input string is “00010110000010100100100010011”. 0

Efficient Data Coding: Huffman & Arithmetic Algorithms

Efficient Data Coding: Huffman & Arithmetic Algorithms

Presentation Transcript

Coding

Coding

Coding mortality data

Coding

Data Coding

Coding

DCSP-8: Minimal length coding I

Coding

Coding Data

Coding?

Coding Qualitative Data

Coding

Coding

Coding

Coding

Variable Length Coding

Coding

Variable Length Coding

Coding

Coding Data

Coding