1 / 77

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 10 2/13/2013. Recommended Reading. Manning & Schutze Chapter 2, Mathematical Foundations, Information Theory Decision trees Marsland chapter 6

talia
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 10 2/13/2013

  2. Recommended Reading • Manning & Schutze • Chapter 2, Mathematical Foundations, Information Theory • Decision trees • Marsland chapter 6 • Nilsson Chapter 5, Decision Trees http://ai.stanford.edu/~nilsson/mlbook.html • Orphanos et al. 1999. Decision Trees and NLP: A Case Study in POS Tagging. • Memory-based learning • Nilsson 5.3, Nearest-Neighbor Methods • Hastie 13.3 http://www-stat.stanford.edu/~tibs/ElemStatLearn/

  3. Outline • Rule-based classifiers • Information theory: information gain • Decision Tree • Memory-based learning • Prototypes and vowel recognition

  4. Rule-based systems • Old-fashioned approach to AI and NLP, before machine learning. • Have some data you want to model. • Work on a series of if-then rules to perform a classification task. • Get some right • Get some wrong, expand and/or refine rules

  5. Let’s find names in text • The incident took place in Tumon on Guam's west coast, said Andy Gibson, an official in Gov. Eddie Calvo's office. • Information that will be useful (“features”) • Pre-defined lists of first and last names • Whether the preceding word is a first name • Whether the following word is a last name • Capitalization • Initials (George W. Bush) • Titles (Gov., Governor, President, Mr., Mrs.) • Whether word is preceded by said • Whether word is followed by , who

  6. Write a hierarchy of if-then rules • An attempt: (not supposed to work that well) • If string is in first name list, then YES • Else if string is in last name list: • If preceded by initial, then YES • Else NO • Else if string is capitalized: • If string is preceded by title, then YES • Else if string is preceded by a first name: • If string is followed by , who, then YES • Else NO • Else if string is preceded by said, then YES • Else if string is followed by a last name, then YES • Else NO • Else NO

  7. Rules are a series of decisions • Pros: • Very easy to interpret • Compare to difficulty for a human to interpret: • Hyperplane equation • Weights in a neural network • Cons: • Hard to come up good combination and order of rules to be tested • Can we learn them automatically?

  8. Weights in a neural network are hard to interpret http://www.emeraldinsight.com/content_images/fig/0580130202012.png

  9. Outline • Rule-based classifiers • Information theory: information gain • Decision Tree • Memory-based learning • Prototypes and vowel recognition

  10. Information theory • http://en.wikipedia.org/wiki/Information_theory • Quantifies the amount of “information” in data • Topics including: • Entropy (lecture #3, 1/16) • Information gain (today) • Mutual information (later) • KL divergence (later)

  11. Entropy of a probability distribution • Entropy is a measure of the average uncertainty in a random variable. • The entropy of a random variable X with probability mass function p(x) is: • which is the weighted sum for the number of bits to specify p(x) • bit: binary digit, a number with two values

  12. Guess my number: each number is equally likely. At each iteration, range to be searched is cut in half Input range: length N • Height of tree: log2 N • e.g. since 24 = 16, log2 16 = 4 • In worst case, binary search requires log2 N iterations • Number of guesses equals entropy of probability distribution [1/N, 1/N, …, 1/N] of possible guesses N/2 N/2 N/4 N/8 N/16

  13. Uncertainty about a random variable • Example: compare these probability distributions over five events 1. [ .2, .2, .2, .2, .2 ] 2. [ .1, .1, .1, .35, .35 ] 3. [ 0, 0, 1.0, 0, 0 ] • For which distributions are you more/less uncertain about what event(s) are likely to occur? • How do you quantify your degree of uncertainty? • With entropy

  14. Properties of entropy • Minimum value: H(X) = 0 • Outcome is fully certain • For example, if one event has 100% probability, and all others have 0% probability • Maximum value: H(X) = log2 |sample space| • Occurs when all events are equiprobable • No knowledge about which events are more/less likely • Maximizes uncertainty

  15. Fewer guesses if I give you additional information • I’m thinking of a number between 1 and 64 • Takes at most log2 64 = 6 guesses • Initially, the probability of each number is 1/64 • But suppose I give you additional information before you start making guesses: I’m thinking of a even number between 1 and 64 • Will now take at most log2 32 = 5 guesses • Uncertainty has been reduced • the probabilities are: • Odd numbers: probability of each is 0 • Even numbers: probability of each is 1/32

  16. “Gain” information • Prior knowledge gives you an information gain of 1 bit • Initially: 64 possible choices • H(X) = 6 • Knowing that the number is even: 32 choices • H(X) = 5 • Have “gained” 1 bit of information: 6 – 5 = 1

  17. Another example • Suppose there are 64 possible numbers. • Average # of guesses = H(X) = log2 64 = 6 bits • Suppose you know my number is <= 10 • Only 10 possible answers • Can ignore other numbers in original range • Max guesses: log2 10 = 3.32 bits • Information gain: 6 – 3.32 = 2.68 bits

  18. Information gain: definition • Information gain is the change in entropy from a prior state to the state once some information is given IG(X, Y) = H(X) − H(X|Y) • H(X|Y) is called conditional entropy

  19. Specific conditional entropy: H(X|Y=v) • The specific conditional entropy H(X|Y=v) is the entropy of X restricted to cases where Y=v • Conditional entropy H(X|Y) is a weighted sum of the specific conditional entropies

  20. Example: guess whether a person speaks English • X = whether a person speaks English • Y = country of origin • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

  21. First calculate H(X) • X = [Yes, No, No, Yes, No, Yes, Yes, No] • H(X) = 1 because p(Yes) = 0.5 and p(No) = 0.5 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

  22. Specific conditional entropy:Suppose Y = India • For Y = India, the X = [Yes, Yes]. • p(X=yes)=1.0, p(X=no) = 0.0 • H(X|Y = India) = H([1.0, 0.0]) = 0.0 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

  23. Specific conditional entropy:Suppose Y = England • For Y = England, X is [No, Yes] • p(X=yes)=0.5, p(X=no) = 0.5 • H(X|Y = England) = H([0.5, 0.5]) = 1.0 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

  24. Specific conditional entropy:Suppose Y = USA • For Y = USA, X is [Yes, No, No, No] • p(X=yes)=0.75, p(X=no) = 0.25 • H(X|Y = USA) = H([0.75, 0.25]) = 0.81 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

  25. Conditional entropy • Weighted sum of specific conditional entropies • v p(Y = v) H(X | Y = v) • India .25 0.0 • England .25 1.0 • USA .5 0.81 • H(X|Y) = .25*0.0 + .25*1.0 + .5*0.81 = .655

  26. Information gain • H(X) = 1.0 (not given country) • H(X|Y) = 0.655 (given country) • IG(X, Y) = H(X) – H(X|Y) = 1.0 – 0.655 = 0.345 bits • Interpretation: we gain information about what language a person speaks, given knowledge of people’s countries • If we know what country a person is from, on average, we need 0.345 fewer bits to predict whether the person speaks English

  27. Outline • Rule-based classifiers • Information theory: information gain • Decision Tree • Memory-based learning • Prototypes and vowel recognition

  28. Decision tree • You have an instance and its features. • Make a series of decisions to determine its classification. • Decision tree is a series of hierarchical If-Then clauses.

  29. Example problem:what should I do this evening? • Predict Activity given values of the features Deadline, Party, and Lazy

  30. Solution decision tree

  31. Components of decision tree • Node = Feature • Label on branch to child node = Value of feature • Leaf: Classification decision

  32. Learning a decision tree • Key question: which feature should we select for a node? • At a node, we split the cases according to their value for a feature. • A good split is one that breaks apart the data into their different classes. • Answer: split on the feature with maximal information gain.

  33. Suppose we have no features.Many choices for activity  high entropy • Party (4), Study (3), TV (1), Pub (1) • Compute H(Activity)

  34. Suppose we split on Party feature.Clean split between ‘Party’ activity and all other activities • Party = Yes: activities = Party (5) • Party = No: activities = Study (3), TV (1), Pub (1)

  35. Suppose we split on Lazy feature.Not a good split for Activity. • Lazy = Yes: activities = Party (3), Study (1), Pub (1), TV (1) • Lazy = No: activities = Party (2), Study (2), TV (1)

  36. Decision tree learning algorithm (ID3) • Input: a set of data S that is labeled • If all examples have the same label: • Return a leaf with that label • Else if there are no features left to test: • Return a leaf with the most common label • Else: • Choose the feature F that maximizes information gain of S to be the next node • Add a branch from the node for each possible value f in F • For each branch: • Calculate Sf by removing F from the set of features • Recursively call algorithm with Sf to compute the gain relative to the current set of examples

  37. Information gain again • Information gain is the change in entropy from a prior state to the state once some information is given IG(X, Y) = H(X) − H(X|Y) • H(X|Y) is called conditional entropy

  38. Calculations for root node • H(X) − H(X|Y) • First calculate entropy H(X), for current set of data

  39. Calculations for root node • Next calculate H(X) - H(X|Y) for each feature Y

  40. Choose a feature to split on • H(X) - H(X | Deadline ) = 0.5435 • H(X) - H(X | Party ) = 1.0 • H(X) - H(X | Lazy ) = 0.21 • Greatest information gain for Party • Split data set on Party • Create branches labeled ‘Yes’ and ‘No’

  41. Recursively construct decision tree;here are calculations for “Part = No” node

  42. Overfitting • Decision trees are prone to overfitting • Constructs a series of if-then clauses that exactly fit particular cases in the training set, instead of more general properties of the data • Preventing overfitting with a validation set, two approaches: • Early pruning: stop growing tree when error starts to increase • Late pruning: Construct the full tree. Remove branches at the bottom, going up.

  43. Geometric interpretation of decision tree • We have data points in a multidimensional feature space • Decision tree partitions feature space into “nested” “rectangles” • Lines are parallel to axes

  44. Example: data points from three classes in 2-D feature space, and learned decision tree

  45. Decision tree: split on value of x1, then split on value of x2

  46. Construct a decision tree for this data, split on integer value of a feature

  47. Root: if x1 >= 2 assign class 1, else rest of cases x1 >= 2? NO YES Class 1

  48. Next split: if x2 >= 2 assign class 2, else class 1 x1 >= 2? NO YES Class 1 x2 >= -9? YES NO Class 1 Class 2

  49. For this data, decision tree is a poor fit compared to hyperplane

More Related