Language-Model Based Text-Compression

Language-Model Based Text-Compression James Connor Antoine El Daher

Compressing with Structure • Compression • Huffman • Arithmetic • Lempel Ziv (LV78 LV77) • Most popular compression tools based on LV77 • Exploiting structure • Our goal: incorporate prior knowledge about the structure of the input sequence

Perplexity and Entropy • Compression ratio is bounded by the Entropy of the sequence to be compressed: • A low-perplexity language model is also a low-entropy distribution:

Character N-grams • Represent text as an nth order markov chain of characters • Maintain counts of n-grams • Build a library of huffman tables based on these counts

Compressing the file • Training • For each bigram in the training set, we keep a map of all the words that can follow it, along with their probabilities. • E.g. “to have”  (“seen”, 0.1), (“been”, 0.1), (UNK, 0.1), etc. • Then for each bigram, we build a Huffman tree.

Compressing the File • Compressing: • We go through the input file, using the Huffman trees from the training set to code each word based on the two preceding words. • If the trigram is unknown, we code the UNK token, the revert to a unigram model (also coded using Huffman). • If the unigram is unknown, we use a character level Huffman (trained on the training set) to code it. • Decompression works similarily; we mimic the same behavior

Extensions • We have a sliding context window, so that whenever we are compressing a file, words that are seen there have their counts incremented when they enter the window (and decremented when they leave); this allows us to make better use of the local context in terms of trigrams/bigrams, and give more representative weights.

Results • Competitive with Gzip

Language-Model Based Text-Compression

Language-Model Based Text-Compression

Presentation Transcript

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval

Content Based Compression

GReAT: A meta-model based model transformation language

Wavelet-based Image Compression

A Compression-Based Model of Musical Learning

SRILM Based Language Model

Probabilistic Language-Model Based Document Retrieval

Compressed indices for text based on Ziv-Lempel compression

Wavelet Based Color Compression

On Compression-Based Text Classification

Language Model Based Arabic Word Segmentation

Text compression

A Concept-based Model for Enhancing Text Categorization

Text Compression

Model-Based Semantic Compression for Network-Data Tables

Text Compression

Text Compression

Text Compression Huffman Coding

Probabilistic Language-Model Based Document Retrieval

Context-based Data Compression