An introduction to Data Compression. General informations. Requirements some programming skills (not so much...) knowledge of data structures ... some work! Office hours ... ... please write me an email monfardini@dii.unisi.it. What is compression?.

  2. General informations • Requirements • some programming skills (not so much...) • knowledge of data structures • ... some work! • Office hours... ... please write me an email monfardini@dii.unisi.it Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  3. What is compression? • Intuitively compression is a method “to press something into a smaller space”. • In our domains a better definition is “to make information shorter” Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  4. Some basic questions • What is information? • How can we measure the amount of information? • Why compression is useful? • How do we compress? • How much we can compress? Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  5. What is information? - I • Commonly the term information refers to the knowledge of some fact, circumstance or thought. • For example we can think about reading a newspaper, news are the information. • syntax • letters, punctuation marks, white spaces, grammar rules ... • semantics • meaning of the words and of the sentences Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  6. What is information? - II • In our domain, information is merely the syntax, i.e. we are interested in the symbols of the alphabet used to express the information. • In order to give a mathematical definition of information we need some principle of Information Theory Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  7. The fundamental concept • A key concept in Information Theory is that the information is conveyed by randomness • Which information give us a biased coin, which outcome is always head? • What about another biased coin, which outcome is head with 90% probability? • We need a way to measure quantitatively the amount of information in some mathematical sense Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  8. The Uncertainty - I • Suppose we have a discrete random variable and is a particular outcome with probability uncertainty • The units are given by the base of the logarithms • base 2  bits • base 10  nats Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  9. The Uncertanty - II • Suppose the random variable output •  each outcome has 1 bit of information •  0 gives no information at all, while if the outcome is 1 the information is Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  10. The Entropy • More useful is the entropy of a random variable with values in a space The entropy is a measure of the average uncertanty of the random variable Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  11. The entropy - examples • Consider again a r.v. with only two possible outcomes, 0 and 1 In this case Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  12. Compression and loss • lossless • decompressed message (file) is an exact copy of the original. Useful for text compression • lossy • some information is lost in the decompressed message (file). Useful for image and sound compression lgnore for a while lossy compression Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  13. Definitions - I • A source code from a r.v. is a mapping from to , the set of finite-length string from a D-ary alphabet. • , codeword for • , length of Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  14. Definitions - II • non-singular code(... trivial ...) • every element of is mapped in a different string of : • extension of a code • uniquely decodable code • its extension is uniquely decodable Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  15. Definitions - III • prefix(betterprefix-free) or istantaneous code • no codeword is a prefix of any other codeword • the advantage is that decoding has no need to look-ahead ... 11? ... Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  16. Examples singular not singular, but not uniquely decodable uniquely decodable, but not instantaneous instantaneous Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  17. Kraft Inequality - I • Theorem (Kraft Inequality) For any instantaneous code over an alphabet of size D, the codeword lengths must satisfy Conversely, given a set of codeword lengths that satisfy this inequality there exists an istantaneous code with these word lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  18. Kraft Inequality - II • Consider a complete D-ary tree • at level k, there are nodes • a node at level has descendants that are nodes at level k level 0 level 1 level 2 level 3 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  19. Kraft Inequality - III • Proof Consider a D-ary tree (not necessarily complete) representing the codewords, each path down the tree is a sequence of symbols, and each leaf (with its unique path) is a codeword. Let be the longest codeword. A codeword of length , being a leaf, imply that at level there are missing nodes Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  20. Kraft Inequality - IV The total number of possible nodes at level is Summing over all codewords Dividing by Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  21. Kraft Inequality - V • Proof Suppose (without loss of generality) that codewords are ordered by length, i.e. . Consider a D-ary tree and start assigning each codeword to a node, starting from . For a generic codeword with length consider the set K of codewords with length , except i. Suppose there is no available node at level i. That is, Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  22. Kraft Inequality - VI but this means that Then that is absurd. Then the obtained tree represents an instantaneous code with desidered codeword lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  23. Models and coders model model • The model supplies the probabilities of the symbols (or of the group of symbols, as we will see later) • The coder encodes and decodes starting from these probabilities compressed text text text encoder decoder Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  24. Good modeling is crucial • What happens if the true probability of the symbols to be coded are but we use ? • Simply, compressed text will be longer, i.e. the average number of bits/symbol will be greater • It is possible to calculate the difference in bit/symbol from the two mass probability p and q, known as relative entropy Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  25. Finite-context models • in english text ... ... but • A finite-context model of order m uses the previous m symbols to make the prediction • Better modeling but we need to extimate much more probabilities Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  26. Finite-state models a 0.5 • Although potentially more powerful (e.g. they can model wheather an odd or even number of as have occurred consecutively), they are not so popular. • Obviously the decoder uses the same model, so they are always in the same states b 0.5 b 0.01 1 2 a 0.99 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  27. Static models • A models is static if we set up a reasonable probability distribution and use it for all the texts to be coded. • Poor performance in case of different kind of sources (english text, financial data...) • One solution is to have K different models and to send the index of the used model • ... but cfr. the book Gadsby by E. V. Wright Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  28. Adaptive models • In order to solve the problems of static modeling, adaptive (or dynamic) models begin with a bland probability distribution, that is refined as more symbols of the text are known • The encoder and the decoder have the same initial distribution, and the same rules to alter it • There could be adaptive models of order m>0 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  29. The zero-frequency problem • The situation in which a symbol is predicted with probability zero should be avoided, as it cannot be coded • One solution: the total number of symbols in the text is increased by 1. This 1/total probability is divided among all unseen symbols • Another solution: to augment by 1 the count of every symbol • Many more solutions... • Which is the best? If text is sufficiently long the compression is similar Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

  30. Symbolwise and dictionary models • The set of all possible symbols of a source is called the alphabet • Symbolwise models provide an extimated probability for each symbol in the alphabet • Dictionary models instead replace substrings in a text with codewords that identify each substring in a collection, called dictionary or codebook Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

