LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 10 2/13/2013

Recommended Reading • Manning & Schutze • Chapter 2, Mathematical Foundations, Information Theory • Decision trees • Marsland chapter 6 • Nilsson Chapter 5, Decision Trees http://ai.stanford.edu/~nilsson/mlbook.html • Orphanos et al. 1999. Decision Trees and NLP: A Case Study in POS Tagging. • Memory-based learning • Nilsson 5.3, Nearest-Neighbor Methods • Hastie 13.3 http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Outline • Rule-based classifiers • Information theory: information gain • Decision Tree • Memory-based learning • Prototypes and vowel recognition

Rule-based systems • Old-fashioned approach to AI and NLP, before machine learning. • Have some data you want to model. • Work on a series of if-then rules to perform a classification task. • Get some right • Get some wrong, expand and/or refine rules

Let’s find names in text • The incident took place in Tumon on Guam's west coast, said Andy Gibson, an official in Gov. Eddie Calvo's office. • Information that will be useful (“features”) • Pre-defined lists of first and last names • Whether the preceding word is a first name • Whether the following word is a last name • Capitalization • Initials (George W. Bush) • Titles (Gov., Governor, President, Mr., Mrs.) • Whether word is preceded by said • Whether word is followed by , who

Write a hierarchy of if-then rules • An attempt: (not supposed to work that well) • If string is in first name list, then YES • Else if string is in last name list: • If preceded by initial, then YES • Else NO • Else if string is capitalized: • If string is preceded by title, then YES • Else if string is preceded by a first name: • If string is followed by , who, then YES • Else NO • Else if string is preceded by said, then YES • Else if string is followed by a last name, then YES • Else NO • Else NO

Rules are a series of decisions • Pros: • Very easy to interpret • Compare to difficulty for a human to interpret: • Hyperplane equation • Weights in a neural network • Cons: • Hard to come up good combination and order of rules to be tested • Can we learn them automatically?

Weights in a neural network are hard to interpret http://www.emeraldinsight.com/content_images/fig/0580130202012.png

Information theory • http://en.wikipedia.org/wiki/Information_theory • Quantifies the amount of “information” in data • Topics including: • Entropy (lecture #3, 1/16) • Information gain (today) • Mutual information (later) • KL divergence (later)

Entropy of a probability distribution • Entropy is a measure of the average uncertainty in a random variable. • The entropy of a random variable X with probability mass function p(x) is: • which is the weighted sum for the number of bits to specify p(x) • bit: binary digit, a number with two values

Guess my number: each number is equally likely. At each iteration, range to be searched is cut in half Input range: length N • Height of tree: log2 N • e.g. since 24 = 16, log2 16 = 4 • In worst case, binary search requires log2 N iterations • Number of guesses equals entropy of probability distribution [1/N, 1/N, …, 1/N] of possible guesses N/2 N/2 N/4 N/8 N/16

Uncertainty about a random variable • Example: compare these probability distributions over five events 1. [ .2, .2, .2, .2, .2 ] 2. [ .1, .1, .1, .35, .35 ] 3. [ 0, 0, 1.0, 0, 0 ] • For which distributions are you more/less uncertain about what event(s) are likely to occur? • How do you quantify your degree of uncertainty? • With entropy

Properties of entropy • Minimum value: H(X) = 0 • Outcome is fully certain • For example, if one event has 100% probability, and all others have 0% probability • Maximum value: H(X) = log2 |sample space| • Occurs when all events are equiprobable • No knowledge about which events are more/less likely • Maximizes uncertainty

Fewer guesses if I give you additional information • I’m thinking of a number between 1 and 64 • Takes at most log2 64 = 6 guesses • Initially, the probability of each number is 1/64 • But suppose I give you additional information before you start making guesses: I’m thinking of a even number between 1 and 64 • Will now take at most log2 32 = 5 guesses • Uncertainty has been reduced • the probabilities are: • Odd numbers: probability of each is 0 • Even numbers: probability of each is 1/32

“Gain” information • Prior knowledge gives you an information gain of 1 bit • Initially: 64 possible choices • H(X) = 6 • Knowing that the number is even: 32 choices • H(X) = 5 • Have “gained” 1 bit of information: 6 – 5 = 1

Another example • Suppose there are 64 possible numbers. • Average # of guesses = H(X) = log2 64 = 6 bits • Suppose you know my number is <= 10 • Only 10 possible answers • Can ignore other numbers in original range • Max guesses: log2 10 = 3.32 bits • Information gain: 6 – 3.32 = 2.68 bits

Information gain: definition • Information gain is the change in entropy from a prior state to the state once some information is given IG(X, Y) = H(X) − H(X|Y) • H(X|Y) is called conditional entropy

Specific conditional entropy: H(X|Y=v) • The specific conditional entropy H(X|Y=v) is the entropy of X restricted to cases where Y=v • Conditional entropy H(X|Y) is a weighted sum of the specific conditional entropies

Example: guess whether a person speaks English • X = whether a person speaks English • Y = country of origin • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

First calculate H(X) • X = [Yes, No, No, Yes, No, Yes, Yes, No] • H(X) = 1 because p(Yes) = 0.5 and p(No) = 0.5 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

Specific conditional entropy:Suppose Y = India • For Y = India, the X = [Yes, Yes]. • p(X=yes)=1.0, p(X=no) = 0.0 • H(X|Y = India) = H([1.0, 0.0]) = 0.0 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

Specific conditional entropy:Suppose Y = England • For Y = England, X is [No, Yes] • p(X=yes)=0.5, p(X=no) = 0.5 • H(X|Y = England) = H([0.5, 0.5]) = 1.0 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

Specific conditional entropy:Suppose Y = USA • For Y = USA, X is [Yes, No, No, No] • p(X=yes)=0.75, p(X=no) = 0.25 • H(X|Y = USA) = H([0.75, 0.25]) = 0.81 • Data: XY • Yes USA • No USA • No England • Yes England • No USA • Yes India • Yes India • No USA

Conditional entropy • Weighted sum of specific conditional entropies • v p(Y = v) H(X | Y = v) • India .25 0.0 • England .25 1.0 • USA .5 0.81 • H(X|Y) = .25*0.0 + .25*1.0 + .5*0.81 = .655

Information gain • H(X) = 1.0 (not given country) • H(X|Y) = 0.655 (given country) • IG(X, Y) = H(X) – H(X|Y) = 1.0 – 0.655 = 0.345 bits • Interpretation: we gain information about what language a person speaks, given knowledge of people’s countries • If we know what country a person is from, on average, we need 0.345 fewer bits to predict whether the person speaks English

Decision tree • You have an instance and its features. • Make a series of decisions to determine its classification. • Decision tree is a series of hierarchical If-Then clauses.

Example problem:what should I do this evening? • Predict Activity given values of the features Deadline, Party, and Lazy

Solution decision tree

Components of decision tree • Node = Feature • Label on branch to child node = Value of feature • Leaf: Classification decision

Learning a decision tree • Key question: which feature should we select for a node? • At a node, we split the cases according to their value for a feature. • A good split is one that breaks apart the data into their different classes. • Answer: split on the feature with maximal information gain.

Suppose we have no features.Many choices for activity  high entropy • Party (4), Study (3), TV (1), Pub (1) • Compute H(Activity)

Suppose we split on Party feature.Clean split between ‘Party’ activity and all other activities • Party = Yes: activities = Party (5) • Party = No: activities = Study (3), TV (1), Pub (1)

Suppose we split on Lazy feature.Not a good split for Activity. • Lazy = Yes: activities = Party (3), Study (1), Pub (1), TV (1) • Lazy = No: activities = Party (2), Study (2), TV (1)

Decision tree learning algorithm (ID3) • Input: a set of data S that is labeled • If all examples have the same label: • Return a leaf with that label • Else if there are no features left to test: • Return a leaf with the most common label • Else: • Choose the feature F that maximizes information gain of S to be the next node • Add a branch from the node for each possible value f in F • For each branch: • Calculate Sf by removing F from the set of features • Recursively call algorithm with Sf to compute the gain relative to the current set of examples

Information gain again • Information gain is the change in entropy from a prior state to the state once some information is given IG(X, Y) = H(X) − H(X|Y) • H(X|Y) is called conditional entropy

Calculations for root node • H(X) − H(X|Y) • First calculate entropy H(X), for current set of data

Calculations for root node • Next calculate H(X) - H(X|Y) for each feature Y

Choose a feature to split on • H(X) - H(X | Deadline ) = 0.5435 • H(X) - H(X | Party ) = 1.0 • H(X) - H(X | Lazy ) = 0.21 • Greatest information gain for Party • Split data set on Party • Create branches labeled ‘Yes’ and ‘No’

Recursively construct decision tree;here are calculations for “Part = No” node

Overfitting • Decision trees are prone to overfitting • Constructs a series of if-then clauses that exactly fit particular cases in the training set, instead of more general properties of the data • Preventing overfitting with a validation set, two approaches: • Early pruning: stop growing tree when error starts to increase • Late pruning: Construct the full tree. Remove branches at the bottom, going up.

Geometric interpretation of decision tree • We have data points in a multidimensional feature space • Decision tree partitions feature space into “nested” “rectangles” • Lines are parallel to axes

Example: data points from three classes in 2-D feature space, and learned decision tree

Decision tree: split on value of x1, then split on value of x2

Construct a decision tree for this data, split on integer value of a feature

Root: if x1 >= 2 assign class 1, else rest of cases x1 >= 2? NO YES Class 1

Next split: if x2 >= 2 assign class 2, else class 1 x1 >= 2? NO YES Class 1 x2 >= -9? YES NO Class 1 Class 2

For this data, decision tree is a poor fit compared to hyperplane

LING / C SC 439/539 Statistical Natural Language Processing