More on Canonical Huffman coding

More on Canonical Huffman coding

As we have seen canonical Huffman coding allows a faster decompression, and a more efficient use of memory • But ... until now we’ve supposed that code lengths are given. In order to prove the viability of canonical Huffman we need a way to compute the code lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Computing code lengths - I • The efficiency of computing the code lengths affects only encoding performances, and encoding is less important than decoding • However, when we use whole words as source symbols, it is common to have an alphabet composed of many thousand different symbols (words). In this case efficient solutions can make the difference w.r.t. the use of the traditional Huffman tree Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The problem • Input data • a file of n integers, where the i-th integer is the number of times symbol i appears in the text • if is the i-th integer in the file, the probability of the i-th symbol is • ... only small changes if we have directly a file with the probabilities • Output data • Huffman code lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The key idea • We use a heap data structure • It is easy to find the smallest value, that is located at the root • There is an elegant and efficient solution to store in memory the binary tree using a simple vector Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

A closer view to the heap - I • As we have already seen, the left child of a node in position i is in position 2i, while the right child si stored in location 2i+1. This means that the parent of a node in position k is in location • What is the depth of the heap if there are n elements? SOL. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

A closer view to the heap - II 2 • How this heap is stored in an array? 8 10 8 17 23 12 21 11 20 2 8 10 8 17 23 12 21 11 20 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

A closer view to the heap - III • Removing the smallest item and reorganizing the heap • cost, 2 comparisons for each level of the tree 2 20 > 8 8 < 10 8 10 20 > 8 8 < 17 8 17 23 12 20 > 11 11 < 21 21 11 20

Construction of the heap • It is possible to prove that the cost of constructing a heap from an unsorted list of n items requires about 2n comparisons • By the way, how much does it cost, in the worst case, to sort the vector with one of the popular sorting algorithms? SOL. nlogn Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Computing code lengths - II • The algorithm works constructing an initial array with 2n positions • the last n positions store the frequencies of the symbols • the first half of the array contains a heap in order to efficiently locate the symbol with le lowest frequency • As entries are removed from the heap, freed space is used to store branches of the tree Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Computing code lengths - III • Also frequencies are overwritten to store the pointers that constitute the tree • At the end the array contains only a one-element heap and the Huffman tree heap leaves (frequencies) 1 n 2n heap leaves & tree pointers 1 h 2n tree pointers h=1 2n

Computing code lengths: phase 1 • Frequencies are read from the file and stored in the last n positions of the array (let’s call it A) • Each position , points to the corrispondent frequency A[n+i] • Then the first half of A is ordered in a heap with the method seen before • In practice we must ensure that • At the end A[1] stores m1: A[m1]=min{A[n+1...2n]} Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Computing code lengths: phase 2 h=n while h>1 { • m1=A[1] -- take the root of the heap • h=h-1 -- position A[h] is no more part of the heap • “reorder the heap” • m2=A[1] -- now m1,m2 point to the two smallest freq. • A[h+1]=A[m1]+A[m2] -- new item is saved in pos. h+1 • A[1]=h+1 -- the new element is pushed back in the heap • A[m1]=A[m2]=h+1 -- smallest frequencies are discarded and changed into tree pointers • “reorder the heap” } Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

example 4 5 7 1 h m1 4 5 7 1 h+1 m1 m2 9 7 1 h+1 m1 m2 9 7 h 1 m1 m2

Computing code lengths: phase 3 • After n-1 iterations a single aggregate remains in the heap, in position 2 and A[1]=2, as this is the only item in the heap • To find the deep in the tree of a particular leaf we can simply start from it and follow the pointers until we reach location 2. The number of pointers followed are the desidered length for { • d=0; r=i; -- d is the counter, r is the current element • while r>2 d=d+1; r=A[r]; -- follow the pointer to location 2 • A[i]=d } -- now A[i] is the length of codeword i Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The cost of the algorithm • first phase  O(n) • second phase  knlogn • a heap of n element is reordered about 2n times • each reordering takes iterations, each of which has a constant cost • third phase  O(n2) in the worst case • there is one iteration for each bit of each codeword • how many total bits? • uniform distribution  n*logn bits... • ... worst case: i-th symbol requires i bits to be coded

Revised third phase - I • Note that • nodes are added to the tree from position h toward position 2 • for this reason all pointers of the tree are “from right to left” • Then if we start from position 2 towards position 2n, when we find a node we’ve always already found its parent! Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Revised third phase - II • So a very efficient algorithm (O(n)) to find code lengths is to start from position 2 that has length 0 and then to proceed towards position 2n labeling each position with the length associated to its parent augmented by 1 • Then third phase becomes A[2]=0 for i=3 to 2n • A[i]=A[A[i]]+1 -- A[A[i]] is the parent of A[i] Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

More on Canonical Huffman coding

More on Canonical Huffman coding

Presentation Transcript

Huffman coding

Dynamic Huffman Coding

Huffman Coding

Huffman coding

Huffman Coding

Adaptive Huffman Coding

HUFFMAN CODING

Adaptive Huffman coding

Huffman Coding

Huffman Coding

EE800 Term Project Huffman Coding

Topic 20: Huffman Coding

Text Compression Huffman Coding

CSE 326 Huffman coding

Greedy Algorithms Huffman Coding

Huffman Coding

Huffman Coding Greedy Algorithm

Huffman Coding