1 / 9

Data Compression and Huffman Trees (HW 4)

Data Compression and Huffman Trees (HW 4). Representing Text (ASCII). Way of representing characters as bits Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘<br>’, ‘t’… Each character is represented by a unique 7 bit code. There are 128 possible characters. STATIC LENGTH ENCODING

teenie
Download Presentation

Data Compression and Huffman Trees (HW 4)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Compression and Huffman Trees(HW 4)

  2. Representing Text(ASCII) • Way of representing characters as bits • Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘\n’, ‘\t’… • Each character is represented by a unique 7 bit code. There are 128 possible characters. • STATIC LENGTH ENCODING • To encode a long text, we encode it character by character.

  3. Inefficiency of ASCII • Realization: In many natural files, we are much likelier to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! • Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’.

  4. Variable Length Coding • Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) • Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). • Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop.

  5. Encoding Trees • Think of encoding as an (unbalanced) tree. • Data is in leaf nodes only (prefix free). • ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 • How to decode ‘01110’? 1 0 e 0 1 a b

  6. Cost of a Tree • For each character ci let fi be its frequency in the file. • Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character). • The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi

  7. Creating an Optimal T • Problem: Find tree T with C(T) minimal. • Solution (Huffman 1952): • Create a tree for each character. The weight of the tree W(T) is the frequency of the character. • Repeat n times (n = number of chars) • Select trees T’, T’’ with lowest weights. Merge them together to form T. • Set W(T)= W(T’) + W(T’’) • Implement Using Min-Heap. • What is running time?

  8. Optimality Intuition • Need to show that Huffman’s algorithm indeed results in a tree T with optimal C(T)= Σci fi. • The two least weight letters should be on bottom as siblings (otherwise improve cost by swapping). • Intuitively when we combine trees we can think of this as a new letter with combined weight.

  9. Homework • Implement: • public class HuffmanTree • public class HuffmanNode • public class BinaryHeap • Read a file ‘huff.txt’ which includes letters and frequencies: • A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2 • Create a Huffman Tree using the discussed algorithm (book 389-395) • Print “legend”: the code of each character

More Related