### Data Structures

Week 6: Assignment #2 Problem

Requirement
• Encode a message using Huffman's algorithm
• Use Min Heap as the priority queue
• dynamic allocation
• The input consists of stings
• A string consists of alphabets only
• Upper case and lower case letters are treated as different characters
• stored in a text file
• given in separate lines
Requirement – cont’
• Output
• should be stored in a text file in the following format
• Due date
• 2001/5/23 24:00

Heap Traversal:

[character or string]...

Huffman Tree Traversal:

[character or string]...

character: frequency, code

the code for the message:

Encoding
• Encode the message as a long bit string
• assign a bit string code to each symbol of the alphabet
• then, concatenate the individual codes of the symbols making up the message to produce an encoding for the message
Example#1

Symbol Code

A 010

B 100

C 000

D 111

• ABACCDA
• 010100010000000111010
• Three bits are used for each symbol
• 21 bits are needed to encode the message
• inefficient
Example#2

Symbol Code

A 00

B 01

C 10

D 11

• ABACCDA
• 00010010101100
• Two bits are used for each symbol
• 14 bits are needed to encode the message
Example#3
• ABACCDA
• Each of the letters B and D appears only once in the message
• The letter A appears three times
• The letter A assigned a shorter bit string than the letters B and D
Example#3 - cont’

Symbol Code

A 0

B 110

C 10

D 111

• ABACCDA
• 0110010101110
• Encoding of the message requires only 13 bits
• more efficient
Variable-Length Code
• If variable-length codes are used
• the code for one symbol may not be a prefix of the code for another
• Example
• The code for a symbol x, c(x)
• a prefix of the code of another symbol y, c(y)
• When c(x) is encountered in a left-to-right scan
• It is unclear whether c(x) represents the symbol x or whether it is the first part of c(y).
Optimal Encoding Scheme(1)

Symbol Frequency

A 3

B 1

C 2

D 1

• Find the two symbols that appear least frequently
• These are B and D
• Combine these two symbols into the single symbol BD
• The frequency of this new symbol is the sum of the frequencies of its two symbols
• The frequency of BD is 2
Optimal Encoding Scheme (2)

Symbol Frequency

A 3

C 2

BD 2

• Again choose the two symbols with smallest frequency
• These are C and BD
• Combine these two symbols into the single symbol CBD
• The frequency of this new symbol is the sum of the frequencies of its two symbols
• The frequency of CBD is 4
Optimal Encoding Scheme (3)

Symbol Frequency

A 3

CBD 4

• There are now only two symbols remaining
• These are combined into the single symbol ACBD
• The frequency of ACBD is 7

Symbol Frequency

ACBD 7

Optimal Encoding Scheme (4)
• ACBD (A and CBD)
• assigned the codes 0 and 1
• CBD (C and BD)
• assigned the codes 10 and 11
• BD (B and D)
• assigned the codes 110 and 111
The Huffman’s Algorithm (7)

ACBD7

ACBD7

A3

CBD4

BD2

C2

B1

D1

The Huffman’s Algorithm (8)
• Build a min heap which contains the nodes of all symbols with the frequency values as the keys
• Delete two nodes from the heap, concatenate the two symbols, add their frequencies, and put the result back into the heap
• Make the two nodes become the two children of the node of the concatenated symbol

i.e) if s=s1 s2 is the symbol concatenated from s1 and s2, then s1 and s2 become the left child and right child of s

• Continue steps 2 and 3 until priority queue is empty
The Huffman’s Algorithm (9)
• Once the Huffman tree is constructed
• the code of any symbol can be constructed by starting at the leaf representing that symbol
• climbing up to the root
• The code is initialized to null
• each time that a left branch is climbed
• 0 is appended to the beginning of the code
• each time that a right branch is climbed
• 1 is appended to the beginning of the code
The Huffman’s Algorithm (10)

VAR

position[i] : a pointer to the ith symbol

n : the number of symbols /*none zero frequency */

frequency[i] : the relative frequency of the ith symbol

code[i] : the code assigned to the ith symbol

p, p1, p2: a pointer to Min heap's node or huffman tree's node

Main Function

{

initialization;

count the frequency of each symbol within the message;

// construct a node for each symbol

for(i=0; i < n; i++){

<p> = create <frequency[i]> a node;

position[i] = p; //a pointer to the leaf containing the ith symbol

insert <p> into Min heap ;

}//end for

The Huffman’s Algorithm (11)

while(Min heap contains more than one item){

<p1> = delete Min heap;

<p2> = delete Min heap;

//combine p1 and p2 as branches of a single tree

<p> = create < info(p1)+info(p2) > a node;

set <p1> to be left_child of huffman tree p;

set <p2> to be right_child of huffman tree p;

insert <p> into Min heap;

}//end while

The Huffman’s Algorithm (12)

//the tree is now constructed; use it to find codes

<root> = delete Min heap;

for(i=0; i<n; i++){

p = position[i];

code[i] = NULL;

while(p!=root){

//travel up to the root

if(is left<p>)

code[i]= 0 followed by code[i];

else

code[i]= 1 followed by code[i];

<p> = move <p> to father node;

} // end while

}//end for

}//end main