300 likes | 410 Views
Huffman Codes. Computing an Optimal Code for a Document. Objectives. You will be able to: Create an optimal code for an ASCII text file. Encode the text file using the optimal code and output the compressed text as a binary file.
E N D
Huffman Codes Computing an Optimal Code for a Document
Objectives You will be able to: • Create an optimal code for an ASCII text file. • Encode the text file using the optimal code and output the compressed text as a binary file. • Read the compressed binary file and reconstruct the original ASCII text. • Output the decoded message to a text file. • Encode and decode a large text file • Moby Dick
Getting Started • Download program from last class. • http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_13_Huffman_Codes_with_Binary_IO/ • File Huffman_Codes_with_Binary_IO.zip • A bit of cleanup • Improve the prompts as shown on the following slides. • Delete commented out sections in main.cpp • Remove output of sorted list in Make_Decode_Tree
Modifications to Prompts main.cpp • In do_decode (line 29) //cout << "File name for input? "; cout << "File name for compressed input file? "; • In do_encode (line 89) //cout << "File name for output? "; cout << "File name for compressed output file? ";
An Error on Circe • Binary_File.h, line 14 should be: static const size_t FIRST_BIT_POSITION = 8*sizeof(size_t); • int and size_t are the same size on 32 bit Windows systems. • Not on Circe. • Probably not on other 64 bit systems. • Other errors and warnings on Circe have fairly obvious fixes.
Text Files for Testing • Download to a convenient directory: • Full text of Moby Dick • http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/Moby_Dick.txt • Abridged version • http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/Moby_Quick.txt
Get Input from a File • Modify the Huffman Code program to get its input for encode from a text file rather than from the keyboard.
main.cpp • Insert above do_encode: void get_text_input_file(string& input_filename, ifstream& infile) { string junk; while (true) { cout << "File name for text input? "; cin >> input_filename; getline(cin, junk); // Skip newline char infile.open(input_filename.c_str()); if (infile.good()) { break; } infile.clear(); cout << "Open failed for file " << input_filename << endl; cout << "Please try again\n"; } } http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_18_Huffman_Code_for_Document/get_text_input_file.cpp.txt
do_encode() Revised version that gets input from a file rather than from the keyboard: http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_18_Huffman_Code_for_Document/do_encode.cpp.txt
do_encode() void do_encode(void) { string msg; string output_filename; Binary_Output_File* outfile; string junk; string input_filename; ifstream infile; get_text_input_file(input_filename, infile); while (true) { cout << "\nFile name for compressed output file? "; cin >> output_filename; getline(cin,junk); // Skip newline char try { outfile =new Binary_Output_File(output_filename); break; } catch (const string& msg) { cout << msg << endl; } }
do_encode() //cout << "\n\nEnter message to encode\n"; //getline(cin, msg); while (infile.good()) { char next_char; infile.get(next_char); string code = huffman_tree.Encode_Char(tolower(next_char)); if (code.size() == 0) { cout << endl << "Invalid character in input " << next_char << endl; continue; } outfile->Output(code); } infile.close(); cout << endl << endl; outfile->Close(); delete(outfile); cout << "File " << output_filename << " written\n"; }
Some Issues • White space • newline characters lost • Punctuation • Capitalization • Let's build a code specifically for this document. • Include all characters. • Optimize weights for the document.
Developing a Code for the Document • New version of build_huffman_tree • Read the input text file and count occurrences of each character. • Also total number of characters in the file • For each ASCII value that appears in the input text file • Compute relative frequency. • Add char and frequency to the Huffman tree.
New build_huffman_tree() http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_18_Huffman_Code_for_Document/build_huffman_tree.cpp.txt void build_huffman_tree(ifstream& infile) { int counts[128] = {0}; int total = 0; // Count characters in the input file. while (infile.good()) { char next_char; infile.get(next_char); assert (next_char > 0); assert (next_char <= 127); ++counts[next_char]; ++total; } infile.close(); infile.clear();
New build_huffman_tree() for (int i = 0; i < 128; ++i) { if (counts[i] > 0) { huffman_tree.Add(i, (1.0*counts[i]) / total); } } }
main.cpp • Add at top: #include <cassert> … string input_filename; ifstream infile; • Add to main() int main(void) { cout << "This is the Huffman code program \n"; get_text_input_file(input_filename, infile); build_huffman_tree(infile);
do_encode() • We have to reopen the input file after reading it Build_Huffman_Tree. • No longer call get_text_input_file. • Comment out call to get_text_input_file near the top. • At line 104: //cout << "\n\nEnter message to encode\n"; //getline(cin, msg); infile.open(input_filename.c_str()); while (infile.good()) {
do_encode() • At line 112 remove call to tolower() infile.open(input_filename.c_str()); while (infile.good()) { char next_char; infile.get(next_char); string code = huffman_tree.Encode_Char(tolower(next_char)); • We now can encode all characters.
So far, so good! • The program seems to be working for a short file. • Let's try it on the full text. • You may not want to wait for the complete output!
Output Decoded Message to a File Add above do_decode(): http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_18_Huffman_Code_for_Document/get_text_output_file.cpp.txt void get_text_output_file(string& output_filename, ofstream& outfile) { string junk; while (true) { cout << "File name for text output? "; cin >> output_filename; getline(cin, junk); // Skip newline char outfile.open(output_filename.c_str()); if (outfile.good()) { break; } outfile.clear(); cout << "Open failed for file " << output_filename << endl; cout << "Please try again\n"; } }
Output Decoded Message to a File • At end of do_decode original_message = huffman_tree.Decode_Msg(coded_message); //cout << "Original message: " << original_message << endl; //cout << endl << endl; string output_filename; ofstream outfile; get_text_output_file(output_filename, outfile); outfile << original_message; outfile.close(); cout << "File " << output_filename << " written"; cout << endl << endl; }
On Circe (After some tweaking)
Embedding the Code • In order for the compressed file to be useful, we have to store the code along with it. • Then we can read and decode the file at a later time. • Even on a different computer (with the same architecture) • In order to decode • First read the code. • Reconstitute the decode tree. • Then read and decode the message. Project 7