1 / 44

CS B551: Decision Trees

CS B551: Decision Trees. Agenda. Decision trees Complexity Learning curves Combatting overfitting Boosting. Recap. Still in supervised setting with logical attributes

selia
Download Presentation

CS B551: Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS B551: Decision Trees

  2. Agenda • Decision trees • Complexity • Learning curves • Combatting overfitting • Boosting

  3. Recap • Still in supervised setting with logicalattributes • Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x)  A(x)  (B(x) v C(x))

  4. A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED

  5. A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED • D = FUNNEL-CAP • E = BULKY

  6. Training Set

  7. D E B A A T F C T F T F T E A F T T F Possible Decision Tree

  8. D E B A A T F C A? CONCEPT  A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT  (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

  9. D E B A A T F C A? CONCEPT  A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT  (D(EvA))v(D(C(Bv(B((EA)v(EA)))))) KIS bias  Build smallest decision tree Computationally intractable problem greedy algorithm

  10. A True False C False False True True B True False False True Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return majority rule • A  error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

  11. Learnable Concepts • Some simple concepts cannot be represented compactly in DTs • Parity(x) = X1 xor X2 xor … xor Xn • Majority(x) = 1 if most of Xi’s are 1, 0 otherwise • Exponential size in # of attributes • Need exponential # of examples to learn exactly • The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

  12. 100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve

  13. Performance Issues • Assessing performance: • Training set and test set • Learning curve 100 Some concepts are unrealizable within a machine’s capacity % correct on test set size of training set Typical learning curve

  14. Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set 100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting

  15. Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set Terminate recursion when # errors / information gain is small

  16. Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set Terminate recursion when # errors / information gain is small

  17. Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning • Incorrect examples • Missing data • Multi-valued and continuous attributes

  18. Using Information Theory • Rather than minimizing the probability of error, minimize the expected number of questions needed to decide if an object x satisfies CONCEPT • Use the information-theoretic quantity known as information gain • Split on variable with highest information gain

  19. Entropy / Information gain • Entropy: encodes the quantity of uncertainty in a random variable • H(X) = -xVal(X) P(x) log P(x) • Properties • H(X) = 0 if X is known, i.e. P(x)=1 for some value x • H(X) > 0 if X is not known with certainty • H(X) is maximal if P(X) is uniform distribution • Information gain: measures the reduction in uncertainty in X given knowledge of Y • I(X,Y) = Ey[H(X) – H(X|Y)] =y P(y) x [P(x|y) log P(x|y) – P(x)log P(x)] • Properties • Always nonnegative • = 0 if X and Y are independent • If Y is a choice, maximizing IG = > minimizing Ey[H(X|Y)]

  20. Maximizing IG / minimizing conditional entropy in decision trees Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y) • Let n be # of examples • Let n+,n- be # of examples on T/F branches of Y • Let p+,p- be accuracy on true/false branches of Y • P(Y) = (p+n++p-n-)/n • P(correct|Y) = p+, P(correct|-Y) = p- • Ey[H(X|Y)]  n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p-+ (1-p-) log (1-p-)]

  21. 7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7 Continuous Attributes • Continuous attributes can be converted into logical ones via thresholds • X => X<a • When considering splitting on X, pick the threshold a to minimize # of errors / entropy

  22. Multi-Valued Attributes • Simple change: consider splits on all values A can take on • Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant • More values => dataset split into smaller example sets when picking attributes • Smaller example sets => more likely to fit well to spurious noise

  23. Statistical Methods for Addressing Overfitting / Noise • There may be few training examples that match the path leading to a deep node in the decision tree • More susceptible to choosing irrelevant/incorrect attributes when sample is small • Idea: • Make a statistical estimate of predictive power (which increases with larger samples) • Prune branches with low predictive power • Chi-squared pruning

  24. Top-down DT pruning • Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly • At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk • Chi-squared statistical significance test: • Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant) • Alternate hypothesis: examples not randomly chosen (X is relevant) • Prune X if testing X is not statistically significant

  25. Chi-Squared test • Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’ • Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds • Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom • Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)

  26. Ensemble Learning (Boosting)

  27. Idea • It may be difficult to search for a single hypothesis that explains the data • Construct multiple hypotheses (ensemble), and combine their predictions • “Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988

  28. Motivation • 5 classifiers with 60% accuracy • On a new example, run them all, and pick the prediction using majority voting • If errors are independent, new classifier has 94% accuracy! • (In reality errors will not be independent, but we hope they will be mostly uncorrelated)

  29. Boosting • Main idea: • If learner 1 fails to learn an example correctly, this example is more important for learner 2 • If learner 1 and 2 fail to learn an example correctly, this example is more important for learner 3 • … • Weighted training set • Weights encode importance

  30. Boosting • Weighted training set

  31. Boosting • Start with uniform weights wi=1/N • Use learner 1 to generate hypothesis h1 • Adjust weights to give higher importance to misclassified examples • Use learner 2 to generate hypothesis h2 • … • Weight hypotheses according to performance, and return weighted majority

  32. Mushroom Example • “Decision stumps” - single attribute DT

  33. Mushroom Example • Pick C first, learns CONCEPT = C

  34. Mushroom Example • Pick C first, learns CONCEPT = C

  35. Mushroom Example • Update weights (precise formula given in R&N)

  36. Mushroom Example • Next try A, learn CONCEPT=A

  37. Mushroom Example • Next try A, learn CONCEPT=A

  38. Mushroom Example • Update weights

  39. Mushroom Example • Next try E, learn CONCEPT=E

  40. Mushroom Example • Next try E, learn CONCEPT=E

  41. Mushroom Example • Update Weights…

  42. Mushroom Example • Final classifier, order C,A,E,D,B • Weights on hypotheses determined by overall error • Weighted majority weightsA=2.1, B=0.9, C=0.8, D=1.4, E=0.09 • 100% accuracy on training set

  43. Boosting Strategies • Prior weighting strategy was the popular AdaBoost algorithm see R&N pp. 667 • Many other strategies • Typically as the number of hypotheses increases, accuracy increases as well • Does this conflict with Occam’s razor?

  44. Announcements • Next class: • Neural networks & function learning • R&N 18.6-7

More Related