1 / 18

More on Canonical Huffman coding

More on Canonical Huffman coding. As we have seen canonical Huffman coding allows a faster decompression, and a more efficient use of memory

eroberts
Download Presentation

More on Canonical Huffman coding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More on Canonical Huffman coding

  2. As we have seen canonical Huffman coding allows a faster decompression, and a more efficient use of memory • But ... until now we’ve supposed that code lengths are given. In order to prove the viability of canonical Huffman we need a way to compute the code lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  3. Computing code lengths - I • The efficiency of computing the code lengths affects only encoding performances, and encoding is less important than decoding • However, when we use whole words as source symbols, it is common to have an alphabet composed of many thousand different symbols (words). In this case efficient solutions can make the difference w.r.t. the use of the traditional Huffman tree Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  4. The problem • Input data • a file of n integers, where the i-th integer is the number of times symbol i appears in the text • if is the i-th integer in the file, the probability of the i-th symbol is • ... only small changes if we have directly a file with the probabilities • Output data • Huffman code lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  5. The key idea • We use a heap data structure • It is easy to find the smallest value, that is located at the root • There is an elegant and efficient solution to store in memory the binary tree using a simple vector Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  6. A closer view to the heap - I • As we have already seen, the left child of a node in position i is in position 2i, while the right child si stored in location 2i+1. This means that the parent of a node in position k is in location • What is the depth of the heap if there are n elements? SOL. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  7. A closer view to the heap - II 2 • How this heap is stored in an array? 8 10 8 17 23 12 21 11 20 2 8 10 8 17 23 12 21 11 20 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  8. A closer view to the heap - III • Removing the smallest item and reorganizing the heap • cost, 2 comparisons for each level of the tree 2 20 > 8 8 < 10 8 10 20 > 8 8 < 17 8 17 23 12 20 > 11 11 < 21 21 11 20

  9. Construction of the heap • It is possible to prove that the cost of constructing a heap from an unsorted list of n items requires about 2n comparisons • By the way, how much does it cost, in the worst case, to sort the vector with one of the popular sorting algorithms? SOL. nlogn Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  10. Computing code lengths - II • The algorithm works constructing an initial array with 2n positions • the last n positions store the frequencies of the symbols • the first half of the array contains a heap in order to efficiently locate the symbol with le lowest frequency • As entries are removed from the heap, freed space is used to store branches of the tree Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  11. Computing code lengths - III • Also frequencies are overwritten to store the pointers that constitute the tree • At the end the array contains only a one-element heap and the Huffman tree heap leaves (frequencies) 1 n 2n heap leaves & tree pointers 1 h 2n tree pointers h=1 2n

  12. Computing code lengths: phase 1 • Frequencies are read from the file and stored in the last n positions of the array (let’s call it A) • Each position , points to the corrispondent frequency A[n+i] • Then the first half of A is ordered in a heap with the method seen before • In practice we must ensure that • At the end A[1] stores m1: A[m1]=min{A[n+1...2n]} Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  13. Computing code lengths: phase 2 h=n while h>1 { • m1=A[1] -- take the root of the heap • h=h-1 -- position A[h] is no more part of the heap • “reorder the heap” • m2=A[1] -- now m1,m2 point to the two smallest freq. • A[h+1]=A[m1]+A[m2] -- new item is saved in pos. h+1 • A[1]=h+1 -- the new element is pushed back in the heap • A[m1]=A[m2]=h+1 -- smallest frequencies are discarded and changed into tree pointers • “reorder the heap” } Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  14. example 4 5 7 1 h m1 4 5 7 1 h+1 m1 m2 9 7 1 h+1 m1 m2 9 7 h 1 m1 m2

  15. Computing code lengths: phase 3 • After n-1 iterations a single aggregate remains in the heap, in position 2 and A[1]=2, as this is the only item in the heap • To find the deep in the tree of a particular leaf we can simply start from it and follow the pointers until we reach location 2. The number of pointers followed are the desidered length for { • d=0; r=i; -- d is the counter, r is the current element • while r>2 d=d+1; r=A[r]; -- follow the pointer to location 2 • A[i]=d } -- now A[i] is the length of codeword i Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  16. The cost of the algorithm • first phase  O(n) • second phase  knlogn • a heap of n element is reordered about 2n times • each reordering takes iterations, each of which has a constant cost • third phase  O(n2) in the worst case • there is one iteration for each bit of each codeword • how many total bits? • uniform distribution  n*logn bits... • ... worst case: i-th symbol requires i bits to be coded

  17. Revised third phase - I • Note that • nodes are added to the tree from position h toward position 2 • for this reason all pointers of the tree are “from right to left” • Then if we start from position 2 towards position 2n, when we find a node we’ve always already found its parent! Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  18. Revised third phase - II • So a very efficient algorithm (O(n)) to find code lengths is to start from position 2 that has length 0 and then to proceed towards position 2n labeling each position with the length associated to its parent augmented by 1 • Then third phase becomes A[2]=0 for i=3 to 2n • A[i]=A[A[i]]+1 -- A[A[i]] is the parent of A[i] Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

More Related