1 / 21

A Brief Introduction to Information Theory

CIS 530 - Intro to NLP. 2. Entropy

ion
Download Presentation

A Brief Introduction to Information Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. CIS 530 - Intro to NLP 1 A Brief Introduction to Information Theory Readings: Manning & Schütze, Chap. 2.1, 2.2 Cover & Thomas Chap 2. (handout) 9/22/09

    2. CIS 530 - Intro to NLP 2 Entropy – A Measure of Uncertainty 9/22/09

    3. CIS 530 - Intro to NLP 3 Entropy – A Measure of Uncertainty 9/22/09

    4. CIS 530 - Intro to NLP 4 Entropy – Another Example Example 1.1.2 (Cover): Suppose we had a horse race with eight horses taking part, Assume their respective odds of winning are: (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64) We can calculate the entropy as H(X) = -˝ log(˝) - Ľ log(Ľ) - 1/8 log(1/8) - 1/16 log(1/16) - 4 * 1/64 log(1/64)) = 2 bits 9/22/09

    5. CIS 530 - Intro to NLP 5 Entropy is Convex 9/22/09

    6. CIS 530 - Intro to NLP 6 Properties of information content H is a continuous function of the pi If all p are equal (pi = 1/n), then H is a monotone increasing function of n If a message is broken into two successive messages, the original H is a weighted sum of the resulting values of H 9/22/09

    7. CIS 530 - Intro to NLP 7 Coding Interpretation of Entropy Suppose we wish to send a message indicating which horse won the race as succinctly as possible. We could send the index of the winning horse, requiring 3 bits (since there are 8 horses). Because the win probabilities are not uniform, we should use shorter descriptions for more probable horses. For example, we could use the following set of bit strings to represent the eight horses- 0, 10, 110, 1110, 111100, 111101, 111110, 111111. The average description length will then be 2 bits, equal to the entropy. 9/22/09

    8. CIS 530 - Intro to NLP 8 Kullback Leibler Divergence 9/22/09

    9. CIS 530 - Intro to NLP 9 Mutual information Mutual information: reduction in uncertainty of one random variable due to knowing about another, or the amount of information one random variable contains about another. 9/22/09

    10. CIS 530 - Intro to NLP 10 Mutual information and entropy 9/22/09

    11. CIS 530 - Intro to NLP 11 Formulas for I(X;Y) I(X;Y) = H(X) – H(X|Y) = H(X) + H(Y) – H(X,Y) 9/22/09

    12. CIS 530 - Intro to NLP 12 Mutual Information 9/22/09

    13. CIS 530 - Intro to NLP 13 Mutual Information Mutual Information can be used to segment Chinese Characters (Sproat & Shi, 1990) 9/22/09

    14. CIS 530 - Intro to NLP 14 Feature selection via Mutual Information Problem: From training set of documents for some given class (topic), choose k words which best discriminate that topic. One way is using terms with maximal Mutual Information with the classes: For each word w and each category c 9/22/09

    15. CIS 530 - Intro to NLP 15 Feature Selection: An Example Test Corpus: Reuters document set. Words in corpus: 704903 Sample subcorpus of ten documents with word “cancer” Words in subcorpus: 5519 “cancer” occurs: 181 times in subcorpus 201 times in entire document set 9/22/09

    16. CIS 530 - Intro to NLP 16 Most probable words given that “Cancer” appears in the document 311 the 181 cancer 171 of 141 and 137 in 123 a 106 to 71 women 69 is 65 that 64 s 61 breast 9/22/09

    17. CIS 530 - Intro to NLP 17 Words sorted by I(w,`cancer`) Word #c total 2I(w,’cancer’) P(w|’cancer’) P(w) Lung 15 15 128 0.00272 2.13e-05 Cancers 14 14 128 0.00254 1.99e-05 Counseling 14 14 128 0.00254 1.99e-05 Mammograms 11 11 128 0.00199 1.56e-05 Oestrogen 10 10 128 0.00181 1.42e-05 Brca 8 8 128 0.00145 1.13e-05 Brewster 9 9 128 0.00163 1.28e-05 Detection 7 7 128 0.00127 9.93e-06 Ovarian 7 7 128 0.00127 9.93e-06 Incidence 6 6 128 0.00109 8.51e-06 Klausner 6 6 128 0.00109 8.51e-06 Lerman 6 6 128 0.00109 8.51e-06 Mammography 4 4 128 0.000725 5.67e-06 9/22/09

    18. CIS 530 - Intro to NLP 18 Examples from: Class-Based n-gram Models of Natural Language Brown et.al. Computational Linguistics, 18-4, 1992 9/22/09

    19. CIS 530 - Intro to NLP 19 Word Cluster Trees Formed with MI 9/22/09

    20. CIS 530 - Intro to NLP 20 Mutual Information Word Clusters 9/22/09

    21. CIS 530 - Intro to NLP 21 Sticky Words and Relative Entropy “Let Pnear(w1w2) be the probability that a word chosen at random from the text is w1 and that a 2nd word, chosen at random from a window of 1,001 words centered on w1, but excluding the words in a window of 5 centered on w1, is w2.” “w1 and w2 are semantically sticky if Pnear(w1w2)>>P(w1)P(w2)” But this is just saying w1 and w2 are semantically sticky if D(Pnear(w1w2)) || P(w1)P(w2)) is large where D is “point” relative entropy… 9/22/09

    22. CIS 530 - Intro to NLP 22 Some Sticky Word Clusters 9/22/09

More Related