1 / 68

Information Theory, Classification & Decision Trees

Information Theory, Classification & Decision Trees. Ling 572 Advanced Statistical Methods in NLP January 5, 2012. Information Theory. Entropy. Information theoretic measure Measures information in model Conceptually , lower bound on # bits to encode

field
Download Presentation

Information Theory, Classification & Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

  2. Information Theory

  3. Entropy • Information theoretic measure • Measures information in model • Conceptually, lower bound on # bits to encode • Entropy: H(X): X is a random var, p: probfn

  4. Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy

  5. Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy

  6. Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy

  7. Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy

  8. Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy

  9. Cross-Entropy • Comparing models • Actual distribution unknown p • Use simplified model to estimate m • Closer match will have lower cross-entropy

  10. Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions

  11. Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions

  12. Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions • Not a proper distance metric:

  13. Relative Entropy • Commonly known as Kullback-Liebler divergence • Expresses difference between probability distributions • Not a proper distance metric: asymmetric • KL(p||q) != KL(q||p)

  14. Joint & Conditional Entropy • Joint entropy:

  15. Joint & Conditional Entropy • Joint entropy: • Conditional entropy:

  16. Joint & Conditional Entropy • Joint entropy: • Conditional entropy:

  17. Joint & Conditional Entropy • Joint entropy: • Conditional entropy:

  18. Perplexity and Entropy • Given that • Consider the perplexity equation: • PP(W) = P(W)-1/N = • = • = 2H(L,P) • Where H is the entropy of the language L

  19. Mutual Information • Measure of information in common between two distributions

  20. Mutual Information • Measure of information in common between two distributions

  21. Mutual Information • Measure of information in common between two distributions

  22. Mutual Information • Measure of information in common between two distributions

  23. Mutual Information • Measure of information in common between two distributions • Symmetric: I(X;Y) = I(Y;X)

  24. Mutual Information • Measure of information in common between two distributions • Symmetric: I(X;Y) = I(Y;X) • I(X;Y) = KL(p(x,y)||p(x)p(y))

  25. Decision Trees

  26. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C

  27. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class

  28. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown

  29. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown • Training data, test data

  30. Two Stages • Training: • Learner: training data  classifier

  31. Two Stages • Training: • Learner: training data  classifier • Testing: • Decoder: test data + classifier  classification output

  32. Two Stages • Training: • Learner: training data  classifier • Classifier: f(x) =y: x is input; y in C • Testing: • Decoder: test data + classifier  classification output • Also • Preprocessing • Postprocessing • Evaluation

  33. Roadmap • Decision Trees: • Sunburn example • Decision tree basics • From trees to rules • Key questions • Training procedure? • Decoding procedure? • Overfitting? • Different feature type? • Analysis: Pros & Cons

  34. Sunburn Example

  35. Learning about Sunburn • Goal: • Train on labeled examples • Predict Burn/None for new instances • Solution?? • Exact match: same features, same output • Problem: 2*3^3 feature combinations • Could be much worse • Same label as ‘most similar’ • Problem: What’s close? Which features matter? • Many match on two features but differ on result

  36. Learning about Sunburn • Better Solution: Decision tree • Training: • Divide examples into subsets based on feature tests • Sets of samples at leaves define classification • Prediction: • Route NEW instance through tree to leaf based on feature tests • Assign same value as samples at leaf

  37. Hair Color Lotion Used Sunburn Decision Tree Blonde Brown Red Emily: Burn Alex: None John: None Pete: None No Yes Sarah: Burn Annie: Burn Katie: None Dana: None

  38. Decision Tree Structure • Internal nodes: • Each node is a test • Generally tests a single feature • E.g. Hair == ? • Theoretically could test multiple features • Branches: • Each branch corresponds to outcome of test • E.g Hair == Red; Hair != Blond • Leaves: • Each leaf corresponds to a decision • Discrete class: Classification / Decision Tree • Real value: Regression

  39. From Trees to Rules • Tree: • Branches from root to leaves = • Tests => classifications • Tests = if antecedents; Leaf labels= consequent • All Decision trees-> rules • Not all rules as trees

  40. Hair Color Lotion Used From ID Trees to Rules Blonde Brown Red Emily: Burn Alex: None John: None Pete: None No Yes Sarah: Burn Annie: Burn Katie: None Dana: None (if (equal haircolor blonde) (equal lotionused yes) (then None)) (if (equal haircolor blonde) (equal lotionused no) (then Burn)) (if (equal haircolor red) (then Burn)) (if (equal haircolor brown) (then None))

  41. Which Tree? • Many possible decision trees for any problem • How can we select among them? • What would be the ‘best’ tree? • Smallest? • Shallowest? • Most accurate on unseen data?

  42. Simplicity • Occam’s Razor: • Simplest explanation that covers the data is best • Occam’s Razor for decision trees: • Smallest tree consistent with samples will be best predictor for new data • Problem: • Finding all trees & finding smallest: Expensive! • Solution: • Greedily build a small tree

  43. Building Trees:Basic Algorithm • Goal: Build a small tree such that all samples at leaves have same class • Greedy solution: • At each node, pick test using ‘best’ feature • Split into subsets based on outcomes of feature test • Repeat process until stopping criterion • i.e. until leaves have same class

  44. Key Questions • Splitting: • How do we select the ‘best’ feature? • Stopping: • When do we stop splitting to avoid overfitting? • Features: • How do we split different types of features? • Binary? Discrete? Continuous?

  45. Building Decision Trees: I • Goal: Build a small tree such that all samples at leaves have same class • Greedy solution: • At each node, pick test such that branches are closest to having same class • Split into subsets where most instances in uniform class

  46. Hair Color Height Lotion Weight Picking a Test Brown Blonde Tall Short Red Average Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Sarah: B Dana: N Annie: B Katie: N Alex: N Pete: N John: N Dana:N Pete:N Emily: B Yes No Heavy Light Average Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N Dana:N Alex:N Annie:B Emily:B Pete:N John:N Sarah:B Katie:N

  47. Height Lotion Weight Picking a Test Tall Short Average Annie:B Katie:N Sarah:B Dana:N Yes No Heavy Light Average Sarah:B Annie:B Dana:N Katie:N Dana:N Annie:B Sarah:B Katie:N

  48. Measuring Disorder • Problem: • In general, tests on large DB’s don’t yield homogeneous subsets • Solution: • General information theoretic measure of disorder • Desired features: • Homogeneous set: least disorder = 0 • Even split: most disorder = 1

  49. Measuring Entropy • If split m objects into 2 bins size m1 & m2, what is the entropy?

More Related