1 / 34

Language and Information

Language and Information. September 21, 2000. Handout #2. Course Information. Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760

Download Presentation

Language and Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language and Information September 21, 2000 Handout #2

  2. Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 305A, West Hall • Phone: (734) 615-5225 • Office hours: TTh 3-4 • Course page: http://www.si.umich.edu/~radev/760 • Class meets on Thursdays, 5-8 PM in 311 West Hall

  3. Readings • Textbook: • Oakes, Chapter 2, pages 53 – 76 • Additional readings • M&S, Chapter 7, pages (minus Section 7.4) • M&S, Chapter 8, pages (minus Sections 8.3-4)

  4. Information Theory

  5. Entropy • Let p(x) be the probability mass function of a random variable X, over a discrete set of symbols (or alphabet) X: p(x) = P(X=x), x X • Example: throwing two coins and counting heads and tails • Entropy (self-information): is the average uncertainty of a single random variable:

  6. Information theoretic measures • Claude Shannon (information theory): “information = unexpectedness” • Series of events (messages) with associated probabilities: pi (i = 1 .. n) • Goal: to measure the information content, H(p1, …, pn) of a particular message • Simplest case: the messages are words • When pi is low, the word is less informative

  7. Properties of information content • H is a continuous function of the pi • If all p are equal (pi = 1/n), then H is a monotone increasing function of n • if a message is broken into two successive messages, the original H is a weighted sum of the resulting values of H

  8. Example • Only function satisfying all three properties is the entropy function: p1 = 1/2, p2 = 1/3, p3 = 1/6 H = -  pilog2 pi

  9. Example (cont’d) H = - (1/2 log2 1/2 + 1/3 log2 1/3 + 1/6 log2 1/6) = 1/2 log2 2 + 1/3 log2 3 + 1/6 log2 6 = 1/2 + 1.585/3 + 2.585/6 = 1.46 Alternative formula for H: H = pilog2 (1/pi)

  10. Another example • Example: • No tickets left: P = 1/2 • Matinee shows only: P = 1/4 • Eve. show, undesirable seats: P = 1/8 • Eve. Show, orchestra seats: P = 1/8

  11. Example (cont’d) H = - (1/2 log 1/2 + 1/4 log 1/4 + 1/8 log 1/8 + 1/8 log 1/8) H = - (1/2 x -1) + (1/4 x -2) + (1/8 x -3) + (1/8 x -3) H = 1.75 (bits per symbol)

  12. Characteristics of Entropy • When one of the messages has a probability approaching 1, then entropy decreases. • When all messages have the same probability, entropy increases. • Maximum entropy: when P = 1/n (H = ??) • Relative entropy: ratio of actual entropy to maximum entropy • Redundancy: 1 - relative entropy

  13. Entropy examples • Letter frequencies in Simplified Polynesian: P(1/8), T(1/4), K(1/8), A(1/4), I (1/8), U (1/8) • What is H(P)? • What is the shortest code that can be designed to describe simplified Polynesian? • What is the entropy of a weighted coin? Draw a diagram.

  14. Joint entropy and conditional entropy • The joint entropy of a pair of discrete random variablesX, Y  p(x,y) is the amount of information needed on average to specify both their values H (X,Y)= -xyp(x,y)log2p(X,Y) • The conditional entropy of a discrete random variable Y given another X, for X, Y  p(x,y) expresses how much extra information is need to communicate Y given that the other party knows X H (Y|X)= -xyp(x,y)log2p(y|x)

  15. Connection between joint and conditional entropies • There is a chain rule for entropy (note that the products in the chain rules for probabilities have become sums because of the log): H (X,Y) = H(X) + H(Y|X)H (X1,…,Xn) = H(X1) + H(X2|X1) + … + H(Xn|X1,…,Xn-1)

  16. Simplified Polynesian revisited

  17. Mutual information H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) • Mutual information: reduction in uncertainty of one random variable due to knowing about another, or the amount of information one random variable contains about another. H(X) – H(X|Y) = H(Y) – H(Y|X) = I(X;Y)

  18. Mutual information and entropy H(X,Y) H(Y|X) H(X|Y) I(X;Y) H(X|Y) H(X|Y) • I(X;Y) is 0 iff two variables are independent • For two dependent variables, mutual information grows not only with the degree of dependence, but also according to the entropy of the variables

  19. Formulas for I(X;Y) I(X;Y) = H(X) – H(X|Y) = H(X) + H(Y) – H(X,Y) I(X;Y) =xyp(x,y)log2 p(x,y) p(x)p(y) Since H(X|X) = 0, note that H(X) = H(X)-H(X|X) = I(X;X) p(x,y) I(x;y) =log2 : pointwise mutual information p(x)p(y)

  20. The noisy channel model X Channel p(y|x) Y Ŵ W Encoder Decoder Message from a finite alphabet Input to channel Output from channel Attempt to reconstruct message based on output 1-p 0 0 Binary symmetric channel p 1 1 1-p

  21. Statistical NLP as decoding problems

  22. Coding

  23. Compression • Huffman coding (prefix property) • Ziv-Lempel codes (better) • arithmetic codes (better for images - why?)

  24. Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

  25. 0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h

  26. Exercise • Consider the bit string: 01101101111000100110001110100111000110101101011101 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding.

  27. Ziv-Lempel coding • Two types - one is known as LZ77 (used in GZIP) • Code: set of triples <a,b,c> • a: how far back in the decoded text to look for the upcoming text segment • b: how many characters to copy • c: new character to add to complete segment

  28. <0,0,p> p • <0,0,e> pe • <0,0,t> pet • <2,1,r> peter • <0,0,_> peter_ • <6,1,i> peter_pi • <8,2,r> peter_piper • <6,3,c> peter_piper_pic • <0,0,k> peter_piper_pick • <7,1,d> peter_piper_picked • <7,1,a> peter_piper_picked_a • <9,2,e> peter_piper_picked_a_pe • <9,2,_> peter_piper_picked_a_peck_ • <0,0,o> peter_piper_picked_a_peck_o • <0,0,f> peter_piper_picked_a_peck_of • <17,5,l> peter_piper_picked_a_peck_of_pickl • <12,1,d> peter_piper_picked_a_peck_of_pickled • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

  29. Arithmetic coding • Uses probabilities • Achieves about 2.5 bits per character

  30. Exercise • Assuming the alphabet consists of a, b, and c, develop arithmetic encoding for the following strings: aaa aab aba baa abc cab cba bac

More Related