Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011

Lecture 4: Lossless Compression(1)Hongli LuoFall 2011

Lecture 4: Lossless Compression (1) • Topics (Chapter 7) • Introduction • Basics of Information Theory • Compression techniques • Lossless compression • Lossy compression

4.1 Introduction • Compression: the process of coding that will effectively reduce the total number of bits needed to represent certain information.

Introduction • Compression ratio: • B0 - number of bits before compression • B1 - number of bits after compression

Types of Compression • Lossless compression • Does not lose information – the signal can be perfectly reconstructed after decompression • Produces a variable bit-rate • It is not guaranteed to actually reduce the data size • Depends on the characteristics of the data • Example : Winzip • Lossy compression • Loses some information – the signal is not perfectly reconstructed after decompression • Produces any desired constant bit-rate • Exampe: JPEG, MPEG

4.2 Basics of Information Theory • Model Information at the Source • Model data at the Source as a Stream of Symbols –This defines the “Vocabulary” of the source. • Each symbol in the vocabulary is represented by bits • If your vocabulary has N symbols, each symbol represented with log2N bits. • Text by ASCII code – 8bits/code: N=28=256 symbols • Speech -16 bits/sample: N=216=65,536 symbols • Color Image: 3x8 bits/sample: N=224=17x106 symbols • 8x8 Image Blocks: 8x64 bits/block: N=2512=1077 symbols

Lossless Compression • Lossless compression techniques ensure no loss of data after compression/decompression. • Coding: “Translate” each symbol in the vocabulary into a “binary codeword”. Codewords may have different binary lengths. • Example: You have 4 symbols (a, b, c, d). Each in binary may be represented using 2 bits each, but coded using a different number of bits. • a(00) -> 000 • b(01) -> 001 • c(10) -> 01 • d(11) -> 1 • Goal of Coding is to minimize the average symbol length

Average Symbol Length • The vocabulary of the source has N symbols • l(i)– binary length of ith symbols • Symbol i has been emitted m(i) times • M = number of symbols that the source emits (on every T second) • Number of bits been emitted in T second • Probability P(i) of a symbol: number of times it occurs in the transmission and is defined as P(i) = m(i) /M

Average Symbol Length • Average length per symbol / average symbol length • Average bit rate

Minimum Average Symbol Length • Goal of compression • to minimize the number of bits being transmitted • Equivalent to minimize the average symbol length • How to reduce the average symbol length • Assign shorter codewords to symbols that appear more frequently, • Assign longer codewords to symbols that appear less frequently

Minimum Average Symbol Length • What is the lower bound of average symbol length? • Decided by the entropy • Shannon’s Information Theorem • The average binary length of the encoded symbol is always greater than or equal to the entropy H of the source

Entropy • The entropy ηof an information source with alphabet S = {s1, s2, …, sn} is: pi - probability that symbol si will occur in S. indicates the amount of information ( self- information as defined by Shannon) contained in si, which corresponds to the number of bits needed to encode si.

Entropy • The entropy is characteristics of a given source of symbols • Entropy is largest (equal to log2N) when all symbols are equally probable • Entropy is small (always >=0) when some symbols are much more likely to appear than other symbols • The chances that each symbol appear are similar, or symbols are uniformly distributed in the source

Entropy and code length • The entropy represents the average amount of information contained per symbol in the source S. • The entropy species the lower bound for the average number of bits to code each symbol in S, i.e., • - the average length (measured in bits) of the codewords produced by the encoder. • Efficiency of the Encoder

Distribution of Gray-Level Intensities Fig. 7.2(a) shows the histogram of an image with uniform distribution of gray-level intensities, i.e., pi = 1/256. Hence, the entropy of this image is: log2 256 = 8 (7.4) - No compression is possible for this image!

4.3 Compression Techniques • Compression techniques are broadly classified into • Lossless compression • Run-length encoding • Variable length coding (entropy coding): • Shannon-fano algorithm, • Huffman coding, • adaptive Huffman coding • Arithmetic coding • LZW • Lossy compression

Run-length Encoding • Sequence of elements, c1, c2, …, ci,…, is mapped to a run (ci,li) • Ci = symbol • li = length of the symbol ci’s run • For example, given the sequence of symbols {1,1,1,1,3,3,4,4,4,3,3,5,2,2,2} The run-length encoding is (1,4),(3,2),(4,3),(3,2),(5,1),(2,3) • Apply run-length encoding to a bi-level image (with only 1-bit black and while pixels) • Assume the starting run is of a particular color (either black or white) • Code the length of each run

Variable Length Coding • VLC generates variable length codewords from fixed length • VLC is one of the best known entropy coding method • Methods of VLC • Shannon-Fano algorithm • Huffman coding • Adaptive Huffman coding

Shannon-Fano Algorithm A top-down approach, Steps: 1. Sort the symbols according to the frequency count of their occurrences. 2. Recursively divide the symbols into two parts, each with approximately the same number of counts, until all parts contain only one symbol. An Example: coding of “HELLO” Sort symbols according to their frequencies, LHEO

Assign bit 0 to its left branches and 1 to the right branches.

Coded bits: 10 bits • Raw datawords, 8 bits per character: 40 bits • Compression ratio : 10/40 = 25%

Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011