Information Retrieval

Information Retrieval Data compression

Architectural features... net L2 RAM HD CPU L1 registers Few Tbs Few Gbs Tens of nanosecs Some words fetched Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets

How much can we compress? For lossless compression, assuming all input messages are valid, if even one string is compressed, some other must expand. Take all messages of length n. Is it possible to compress them in less bits ? NO, they are 2n but we have less compressed msg… We need to talk about stochastic sources

Entropy (Shannon, 1948) For a set of symbols S with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s)

Statistical Coding How do we use the p(s) to encode s? • Prefix codes and relationship to Entropy • Huffman codes • Arithmetic codes

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? Is it aba, ca, or, ad? A uniquely decodable code is a variable length code in which bit strings can always be uniquely decomposed into their codewords.

0 1 a 0 1 d b c Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1

Average Length For a code C with probability p(s) associated to symbol s and codeword length |C[s]|, the average length is defined as We say that a prefix code C is optimal if for all prefix codes C’, la(C) la(C’) Fact (Kraft-McMillan): For any optimal uniquely decodable code, it does exist a prefix code with the same symbol lengths and thus same average optimal length. And vice versa…

A property of optimal codes Theorem.If C is an optimal prefix code for the source S={p1, …, pn} then pi < pj l(ci) ≥ l(cj) Golden rule: Assign shorter codewords to frequent symbols

Relationship to Entropy Theorem (lower bound, Shannon).For any probability distribution p(S) with associated uniquely decodable code C, we have Theorem (upper bound).For any probability distribution p(S) and its optimal prefix code C, Huffman Arithmetic

Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in many, if not most, compression algorithms • gzip, bzip, jpeg (as option), fax compression,… Properties: • Generates optimal prefix codes • Cheap to generate codes • Cheap to encode and decode • la=H if probabilities are powers of 2 • Otherwise, at most 1 bit more per symbol!!!

0 1 1 (.3) 1 0 (.5) 0 (1) Example p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 a(.1) b(.2) c(.2) d(.5) a=000, b=001, c=01, d=1 What about ties ? What about tree depth ?

Huffman Codes Huffman Algorithm • Start with a set of singleton trees (leaves), one per symbol s and with weight p(s) • Repeat: • Select two trees with minimum weight roots p1and p2 • Join into one single tree by adding a new root with weight p1 + p2

Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. 1 0 (.5) d(.5) abc... 00000101 1 0 (.3) 101001...  dcb c(.2) 0 1 a(.1) b(.2)

A property on tree contraction

Huffman codes are optimal

Bounding the Huffman-codes’ length We derive the bound on lx and ly (by cases px < py and vice versa).

Optimum vs. Huffman

Canonical Huffman Codes Huffman codes can be made succinct in the representation of the codeword tree, and fast in (de)coding. Encoding Decoding

You find this at

7 bits huffman tagging g a b g a b 1 0 0 Codeword Byte-aligned codeword T = “bzip or not bzip” a a C(T) 1 0  1  1  0 0  [bzip] [ ] [or] a a 1  1 0  1  1  0  0 [bzip] [not] [ ] [ ] Byte-aligned Huffword [Moura et al, 98] • Compressed text derived from a word-based Huffman: • Symbols of the huffman tree are the words of T • The Huffman tree has fan-out 128 • Codewords are byte-aligned and tagged “or” a g b g b space a bzip b a or not

= 1a 0b T = “bzip or not bzip” no yes yes a a C(T) 1 0  1  1  0 0  [bzip] [ ] [or] a a 1  1 0  1  1  0  0 [bzip] [not] [ ] [ ] no CGrep and other ideas... a g P= bzip b g b space a Byte-search bzip b a or not Note: (1) no need to decompress the text, but direct search over C(T)

You find this at You find it under my Software projects

Problem with Huffman Coding Consider a message with probability .999. The self information of this message is If we were to send 1000 such messages we might hope to use 1000*.0014 = 1.44 bits. Using Huffman, we take at least one bit per message, so we would require 1000 bits.

What can we do? • Huffman may loose 1 bit per symbol, and this may be large for highly-probable symbols • We can “enlarge” the symbol: • Macro-symbol = block of k symbols • 1/k-bits of inefficiency • Larger model to be transmitted • Possibly too much since the msg is limited • Shannon considered infinite-sequences, and k  ∞ !! • Arithmetic does better, achieving almost the optimum for 0-entropy encoders: nH0 + 2 bits in theory, nH0 + 0.02 n bits in practice

Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation is not too bad.

1.0 c = .3 0.7 b = .5 0.2 a = .2 0.0 Arithmetic Coding (message intervals) Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. f(a) = .0, f(b) = .2, f(c) = .7 The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

0.7 1.0 0.3 c = .3 c = .3 c = .3 0.7 0.55 0.27 b = .5 b = .5 b = .5 0.2 0.3 0.21 a = .2 a = .2 a = .2 0.0 0.2 0.2 Arithmetic Coding: Encoding Example Coding the message sequence: bac The final sequence interval is [.27,.3)

Arithmetic Coding To code a sequence of symbols with probabilitiespi (i = 1..n) use the following: Each message narrows the interval by a factor of pi. Final interval size is The interval for a message sequence will be called the sequence interval

Uniquely defining an interval Important property:The intervals for distinct messages of length n will never overlap Therefore by specifying any number in the final interval uniquely determines the msg. Decoding is similar to encoding, but on each step need to determine what the message value is and then reduce interval

0.7 0.55 1.0 c = .3 c = .3 c = .3 0.7 0.55 0.475 b = .5 b = .5 b = .5 0.49 0.49 0.49 0.2 0.3 0.35 a = .2 a = .2 a = .2 0.0 0.2 0.3 Arithmetic Coding: Decoding Example Decoding the number .49, knowing the message is of length 3: The message is bbc.

Representing an Interval Binary fractional representation: So how about just using the smallest binary fractional representation in the sequence interval. e.g. [0,.33) = .01 [.33,.66) = .1 [.66,1) = .11 • Algorithm • x = 2 *x • If x < 1 output 0 • else x = x - 1; output 1

Representing an Interval (continued) Can view binary fractional numbers as intervals by considering all completions. We will call this the code interval.

.79 .75 .625 .61 Selecting the Code Interval To find a prefix code, find a binary fractional number whose code interval is contained in the sequence interval (dyadic number). Can use L + s/2truncated to  1 + log (1/s) bits Sequence Interval Code Interval (.101)

Bound on Arithmetic length Note that –log s+1 = log (1/s)+1

Bound on Length Theorem: For a text of length n, the Arithmetic encoder generates at most 1 + log (1/s) = = 1 +  log ∏ (1/pi)  ≤ 2 + ∑ j=1,n log (1/pi) = 2 + ∑k=1,|| npk log (1/pk) = 2 + n H0 bits nH0 + 0.02 n bits in practice because of rounding

Integer Arithmetic Coding Problem is that operations on arbitrary precision real numbers is expensive. Key Ideas of integer version: • Keep integers in range [0..R) where R=2k • Use rounding to generate integer interval • Whenever sequence intervals falls into top, bottom or middle half, expand the interval by factor of 2 Integer Algorithm is an approximation

Integer Arithmetic (scaling) If l  R/2 then (in top half) Output 1 followed by m 0s m = 0 Message interval is expanded by 2 If u < R/2 then (in bottom half) Output 0 followed by m 1s m = 0 Message interval is expanded by 2 Ifl R/4 and u < 3R/4 then (in middle half) Increment m Message interval is expanded by 2 All other cases, just continue...

You find this at

Using Conditional Probabilities: PPM Use previous k characters as the context. • Makes use of conditional probabilities Base probabilities on counts:e.g. if seen th 12 times followed by e 7 times, then the conditional probability p(e|th) = 7/12. Need to keep k small so that dictionary does not get too large (typically less than 8).

PPM: Partial Matching Problem: What do we do if we have not seen context followed by character before? • Cannot code 0 probabilities! The key idea of PPM is to reduce context size if previous match has not been seen. • If character has not been seen before with current context of size 3, send an escape-msg and then try context of size 2, and then again an escape-msg and context of size 1, …. Keep statistics for each context size < k The escape is a special character with some probability. • Different variants of PPM use different heuristics for the probability.

PPM: Example Contexts String = ACCBACCACBA k = 2

You find this at: compression.ru/ds/

Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: • How the dictionary is stored • How it is extended • How it is indexed • How elements are removed LZ-algos are asymptotically optimal, i.e. their compression ratio goes to H(S) for n   !! No explicit frequency estimation

? ? ? a a c a a c a b c a b a b a c Dictionary(all substrings starting here) LZ77: Sliding Window Algorithm’s step: • Output <d, len, c> d = distance of copied string wrt current positionlen = length of longest matchc = next char in text beyond longest match • Advance by len + 1 A buffer “window” has fixed length and moves Cursor <2,3,c>

(3,3,a) a a a a c c a a a a c c a a b b c c a a b b a a a a a a c c (0,0,a) a c a a c a b c a b a a a c a (1,1,c) a a a a c c a a a a c c a a b b c c a a b b a a a a a a c c (3,4,b) a a a a c c a a a a c c a a b b c c a a b b a a a a a a c c (1,2,c) a a a a c c a a a a c c a a b b c c a a b b a a a a a a c c within W Example: LZ77 with window Window size = 6 Longest match Next character

LZ77 Decoding Decoder keeps same dictionary window as encoder. • Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) • E.g. seen = abcd, next codeword is (2,9,e) • Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] • Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length)or(1,char) Typically uses the second format if length < 3. (1,a) a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,c) a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c

Optimizations used by gzip (cont.) • Triples are coded with Huffman’s code • Specialgreedy: possibly use shorter match so that next match is better • Use hash table to store dictionary: • Hash is based on strings of length 3. • Find the longest match by extending the positions lying in the current hash bucket. • Store within bucket in decreasing order of position

Information Retrieval