700 likes | 1.09k Views
Information Theory, Classification & Decision Trees. Ling 572 Advanced Statistical Methods in NLP January 5, 2012. Information Theory. Entropy. Information theoretic measure Measures information in model Conceptually , lower bound on # bits to encode
E N D
Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012
Entropy • Information theoretic measure • Measures information in model • Conceptually, lower bound on # bits to encode • Entropy: H(X): X is a random var, p: probfn
Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy
Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy
Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy
Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy
Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy
Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy
Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions
Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions
Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions • Not a proper distance metric:
Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions • Not a proper distance metric: asymmetric • KL(p||q) != KL(q||p)
Joint & Conditional Entropy • Joint entropy:
Joint & Conditional Entropy • Joint entropy: • Conditional entropy:
Joint & Conditional Entropy • Joint entropy: • Conditional entropy:
Joint & Conditional Entropy • Joint entropy: • Conditional entropy:
Perplexity and Entropy • Given that • Consider the perplexity equation: • PP(W) = P(W)-1/N = • = • = 2H(L,P) • Where H is the entropy of the language L
Mutual Information • Measure of information in common between two distributions
Mutual Information • Measure of information in common between two distributions
Mutual Information • Measure of information in common between two distributions
Mutual Information • Measure of information in common between two distributions
Mutual Information • Measure of information in common between two distributions • Symmetric: I(X;Y) = I(Y;X)
Mutual Information • Measure of information in common between two distributions • Symmetric: I(X;Y) = I(Y;X) • I(X;Y) = KL(p(x,y)||p(x)p(y))
Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C
Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class
Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown
Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown • Training data, test data
Two Stages • Training: • Learner: training data classifier
Two Stages • Training: • Learner: training data classifier • Testing: • Decoder: test data + classifier classification output
Two Stages • Training: • Learner: training data classifier • Classifier: f(x) =y: x is input; y in C • Testing: • Decoder: test data + classifier classification output • Also • Preprocessing • Postprocessing • Evaluation
Roadmap • Decision Trees: • Sunburn example • Decision tree basics • From trees to rules • Key questions • Training procedure? • Decoding procedure? • Overfitting? • Different feature type? • Analysis: Pros & Cons
Learning about Sunburn • Goal: • Train on labeled examples • Predict Burn/None for new instances • Solution?? • Exact match: same features, same output • Problem: 2*3^3 feature combinations • Could be much worse • Same label as ‘most similar’ • Problem: What’s close? Which features matter? • Many match on two features but differ on result
Learning about Sunburn • Better Solution: Decision tree • Training: • Divide examples into subsets based on feature tests • Sets of samples at leaves define classification • Prediction: • Route NEW instance through tree to leaf based on feature tests • Assign same value as samples at leaf
Hair Color Lotion Used Sunburn Decision Tree Blonde Brown Red Emily: Burn Alex: None John: None Pete: None No Yes Sarah: Burn Annie: Burn Katie: None Dana: None
Decision Tree Structure • Internal nodes: • Each node is a test • Generally tests a single feature • E.g. Hair == ? • Theoretically could test multiple features • Branches: • Each branch corresponds to outcome of test • E.g Hair == Red; Hair != Blond • Leaves: • Each leaf corresponds to a decision • Discrete class: Classification / Decision Tree • Real value: Regression
From Trees to Rules • Tree: • Branches from root to leaves = • Tests => classifications • Tests = if antecedents; Leaf labels= consequent • All Decision trees-> rules • Not all rules as trees
Hair Color Lotion Used From ID Trees to Rules Blonde Brown Red Emily: Burn Alex: None John: None Pete: None No Yes Sarah: Burn Annie: Burn Katie: None Dana: None (if (equal haircolor blonde) (equal lotionused yes) (then None)) (if (equal haircolor blonde) (equal lotionused no) (then Burn)) (if (equal haircolor red) (then Burn)) (if (equal haircolor brown) (then None))
Which Tree? • Many possible decision trees for any problem • How can we select among them? • What would be the ‘best’ tree? • Smallest? • Shallowest? • Most accurate on unseen data?
Simplicity • Occam’s Razor: • Simplest explanation that covers the data is best • Occam’s Razor for decision trees: • Smallest tree consistent with samples will be best predictor for new data • Problem: • Finding all trees & finding smallest: Expensive! • Solution: • Greedily build a small tree
Building Trees:Basic Algorithm • Goal: Build a small tree such that all samples at leaves have same class • Greedy solution: • At each node, pick test using ‘best’ feature • Split into subsets based on outcomes of feature test • Repeat process until stopping criterion • i.e. until leaves have same class
Key Questions • Splitting: • How do we select the ‘best’ feature? • Stopping: • When do we stop splitting to avoid overfitting? • Features: • How do we split different types of features? • Binary? Discrete? Continuous?
Building Decision Trees: I • Goal: Build a small tree such that all samples at leaves have same class • Greedy solution: • At each node, pick test such that branches are closest to having same class • Split into subsets where most instances in uniform class
Hair Color Height Lotion Weight Picking a Test Brown Blonde Tall Short Red Average Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Sarah: B Dana: N Annie: B Katie: N Alex: N Pete: N John: N Dana:N Pete:N Emily: B Yes No Heavy Light Average Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N Dana:N Alex:N Annie:B Emily:B Pete:N John:N Sarah:B Katie:N
Height Lotion Weight Picking a Test Tall Short Average Annie:B Katie:N Sarah:B Dana:N Yes No Heavy Light Average Sarah:B Annie:B Dana:N Katie:N Dana:N Annie:B Sarah:B Katie:N
Measuring Disorder • Problem: • In general, tests on large DB’s don’t yield homogeneous subsets • Solution: • General information theoretic measure of disorder • Desired features: • Homogeneous set: least disorder = 0 • Even split: most disorder = 1
Measuring Entropy • If split m objects into 2 bins size m1 & m2, what is the entropy?