An Overview of Different Compression Algorithms. Their application on compressing inverted files. Alternative Compression Algorithms. Arithmetic coding Huffman coding Character-based Word-based Dictionary-based coding – Ziv-Lempel family of coding. Pros and Cons of Different Algorithms.

### An Overview of Different Compression Algorithms

Their application on compressing inverted files

• Arithmetic coding

• Huffman coding

• Character-based

• Word-based

• Dictionary-based coding – Ziv-Lempel family of coding

• Factors need to be considered

• Compression ratio

• Speed

• Random access

• In modern IR system, Word-based Huffman coding is commonly used

• There are a lot of research on Ziv-Lempel family coding to see if they can be applied to indices compression

• Conventional LZ family compression algorithms use a sliding window approach.

• Based on longest matching length (m-length)

• An improved sliding window LZ algorithm is proposed by Bender and Wolf.

• Instead of m-length, the improved algorithm is based on the offset of the length (o-length) and the differential of the length (-length)

• Better compression ratio in the experiment

• Still linear compression and searching: O(n).

• It didn’t really provide an LZ algorithm that support random access.

• Proposed by Williams

• Use literal/copy item;

• Each step, transmit original if it is a literal item, a pointer if it is a copy item;

• Aimed at faster compression speed and smaller memory footprint.

• Better used in the embedded system where real-time compression is required.

• Inappropriate for index compression.

• Up to date, the best practical compression algorithm for index is still word-based Huffman coding.

• There are theoretical studies about Ziv-Lempel family coding. Non of them are practically applicable to our problem. But they can be used in other areas.

• An Improved Data Compression Algorithm Based on Ziv-Lempel Data Compression Algorithm, Paul Edward Bender and Jack Keil Wolf;

• An Extremely Fast Ziv-Lempel Data Compression Algorithm, Ross N. Williams;

• Modern Information Retrieval, Ricardo Baeza-Yates and Berthier Ribeiro-Neto;