Information Theory, Classification & Decision Trees

Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

Information Theory

Entropy • Information theoretic measure • Measures information in model • Conceptually, lower bound on # bits to encode • Entropy: H(X): X is a random var, p: probfn

Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy

Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions

Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions • Not a proper distance metric:

Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions • Not a proper distance metric: asymmetric • KL(p||q) != KL(q||p)

Joint & Conditional Entropy • Joint entropy:

Joint & Conditional Entropy • Joint entropy: • Conditional entropy:

Perplexity and Entropy • Given that • Consider the perplexity equation: • PP(W) = P(W)-1/N = • = • = 2H(L,P) • Where H is the entropy of the language L

Mutual Information • Measure of information in common between two distributions

Mutual Information • Measure of information in common between two distributions • Symmetric: I(X;Y) = I(Y;X)

Mutual Information • Measure of information in common between two distributions • Symmetric: I(X;Y) = I(Y;X) • I(X;Y) = KL(p(x,y)||p(x)p(y))

Decision Trees

Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C

Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class

Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown

Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown • Training data, test data

Two Stages • Training: • Learner: training data  classifier

Two Stages • Training: • Learner: training data  classifier • Testing: • Decoder: test data + classifier  classification output

Two Stages • Training: • Learner: training data  classifier • Classifier: f(x) =y: x is input; y in C • Testing: • Decoder: test data + classifier  classification output • Also • Preprocessing • Postprocessing • Evaluation

Roadmap • Decision Trees: • Sunburn example • Decision tree basics • From trees to rules • Key questions • Training procedure? • Decoding procedure? • Overfitting? • Different feature type? • Analysis: Pros & Cons

Sunburn Example

Learning about Sunburn • Goal: • Train on labeled examples • Predict Burn/None for new instances • Solution?? • Exact match: same features, same output • Problem: 2*3^3 feature combinations • Could be much worse • Same label as ‘most similar’ • Problem: What’s close? Which features matter? • Many match on two features but differ on result

Learning about Sunburn • Better Solution: Decision tree • Training: • Divide examples into subsets based on feature tests • Sets of samples at leaves define classification • Prediction: • Route NEW instance through tree to leaf based on feature tests • Assign same value as samples at leaf

Hair Color Lotion Used Sunburn Decision Tree Blonde Brown Red Emily: Burn Alex: None John: None Pete: None No Yes Sarah: Burn Annie: Burn Katie: None Dana: None

Decision Tree Structure • Internal nodes: • Each node is a test • Generally tests a single feature • E.g. Hair == ? • Theoretically could test multiple features • Branches: • Each branch corresponds to outcome of test • E.g Hair == Red; Hair != Blond • Leaves: • Each leaf corresponds to a decision • Discrete class: Classification / Decision Tree • Real value: Regression

From Trees to Rules • Tree: • Branches from root to leaves = • Tests => classifications • Tests = if antecedents; Leaf labels= consequent • All Decision trees-> rules • Not all rules as trees

Hair Color Lotion Used From ID Trees to Rules Blonde Brown Red Emily: Burn Alex: None John: None Pete: None No Yes Sarah: Burn Annie: Burn Katie: None Dana: None (if (equal haircolor blonde) (equal lotionused yes) (then None)) (if (equal haircolor blonde) (equal lotionused no) (then Burn)) (if (equal haircolor red) (then Burn)) (if (equal haircolor brown) (then None))

Which Tree? • Many possible decision trees for any problem • How can we select among them? • What would be the ‘best’ tree? • Smallest? • Shallowest? • Most accurate on unseen data?

Simplicity • Occam’s Razor: • Simplest explanation that covers the data is best • Occam’s Razor for decision trees: • Smallest tree consistent with samples will be best predictor for new data • Problem: • Finding all trees & finding smallest: Expensive! • Solution: • Greedily build a small tree

Building Trees:Basic Algorithm • Goal: Build a small tree such that all samples at leaves have same class • Greedy solution: • At each node, pick test using ‘best’ feature • Split into subsets based on outcomes of feature test • Repeat process until stopping criterion • i.e. until leaves have same class

Key Questions • Splitting: • How do we select the ‘best’ feature? • Stopping: • When do we stop splitting to avoid overfitting? • Features: • How do we split different types of features? • Binary? Discrete? Continuous?

Building Decision Trees: I • Goal: Build a small tree such that all samples at leaves have same class • Greedy solution: • At each node, pick test such that branches are closest to having same class • Split into subsets where most instances in uniform class

Hair Color Height Lotion Weight Picking a Test Brown Blonde Tall Short Red Average Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Sarah: B Dana: N Annie: B Katie: N Alex: N Pete: N John: N Dana:N Pete:N Emily: B Yes No Heavy Light Average Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N Dana:N Alex:N Annie:B Emily:B Pete:N John:N Sarah:B Katie:N

Height Lotion Weight Picking a Test Tall Short Average Annie:B Katie:N Sarah:B Dana:N Yes No Heavy Light Average Sarah:B Annie:B Dana:N Katie:N Dana:N Annie:B Sarah:B Katie:N

Measuring Disorder • Problem: • In general, tests on large DB’s don’t yield homogeneous subsets • Solution: • General information theoretic measure of disorder • Desired features: • Homogeneous set: least disorder = 0 • Even split: most disorder = 1

Measuring Entropy • If split m objects into 2 bins size m1 & m2, what is the entropy?

Information Theory, Classification & Decision Trees