1 / 46

CS B351: Decision Trees

CS B351: Decision Trees. Agenda. Decision trees Learning curves Combatting overfitting. a small one!. Classification Tasks. Supervised learning setting The target function f(x) takes on values True and False A example is positive if f is True, else it is negative

orsin
Download Presentation

CS B351: Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS B351: Decision Trees

  2. Agenda • Decision trees • Learning curves • Combatting overfitting

  3. a small one! Classification Tasks • Supervised learning setting • The target function f(x) takes on values True and False • A example is positive if f is True, else it is negative • The set X of all possible examples is the example set • The training set is a subset of X

  4. Logical Classification Dataset • Here, examples (x, f(x)) take on discrete values

  5. Logical Classification Dataset • Here, examples (x, f(x)) take on discrete values Concept Note that the training set does not say whether an observable predicate is pertinent or not

  6. Logical Classification Task • Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x)  A(x)  (B(x) v C(x))

  7. A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED

  8. A? True False B? False False True C? True True False True False Predicate as a Decision Tree The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can be represented by the following decision tree: • Example:A mushroom is poisonous iffit is yellow and small, or yellow, • big and spotted • x is a mushroom • CONCEPT = POISONOUS • A = YELLOW • B = BIG • C = SPOTTED • D = FUNNEL-CAP • E = BULKY

  9. Training Set

  10. D E B A A T F C T F T F T E A F T T F Possible Decision Tree

  11. D E B A A T F C A? CONCEPT  A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT  (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

  12. D E B A A T F C A? CONCEPT  A (B v C) True False T F B? False False T F T True E C? True A False True True False F T T F Possible Decision Tree CONCEPT  (D(EvA))v(D(C(Bv(B((EA)v(EA)))))) KIS bias  Build smallest decision tree Computationally intractable problem greedy algorithm

  13. Getting Started:Top-Down Induction of Decision Tree The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12

  14. Getting Started:Top-Down Induction of Decision Tree The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm

  15. A F T 6, 7, 8, 9, 10, 13 11, 12 True: False: 1, 2, 3, 4, 5 If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise  The number of misclassified examples from the training set is 2 Assume It’s A

  16. B F T 9, 10 2, 3, 11, 12 True: False: 6, 7, 8, 13 1, 4, 5 If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise  The number of misclassified examples from the training set is 5 Assume It’s B

  17. C F T 6, 8, 9, 10, 13 1, 3, 4 True: False: 7 1, 5, 11, 12 If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise  The number of misclassified examples from the training set is 4 Assume It’s C

  18. D F T 7, 10, 13 3, 5 True: False: 6, 8, 9 1, 2, 4, 11, 12 If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise  The number of misclassified examples from the training set is 5 Assume It’s D

  19. E F T 8, 9, 10, 13 1, 3, 5, 12 True: False: 6, 7 2, 4, 11 If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 Assume It’s E

  20. E F T 8, 9, 10, 13 1, 3, 5, 12 True: False: 6, 7 2, 4, 11 If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 Assume It’s E So, the best predicate to test is A

  21. 6, 8, 9, 10, 13 True: False: 7 11, 12 Choice of Second Predicate A F T False C F T  The number of misclassified examples from the training set is 1

  22. 11,12 True: False: 7 Choice of Third Predicate A F T False C F T True B T F

  23. A True False A? C False False True True False B? False True B False True True False C? True False True True False True False Final Tree CONCEPT  A (C v B) CONCEPT  A (B v C)

  24. A True False C False False True True B True False False True Subset of examples that satisfy A Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return failure • A  error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

  25. A True False C False False True True B True False Noise in training set! May return majority rule,instead of failure False True Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return failure • A  error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

  26. A True False C False False True True B True False False True Top-DownInduction of a DT DTL(D, Predicates) • If all examples in D are positive then return True • If all examples in D are negative then return False • If Predicates is empty then return majority rule • A  error-minimizing predicate in Predicates • Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

  27. Comments • Widely used algorithm • Easy to extend to k-class classification • Greedy • Robust to noise (incorrect examples) • Not incremental

  28. Human-Readability • DTs also have the advantage of being easily understood by humans • Legal requirement in many areas • Loans & mortgages • Health insurance • Welfare

  29. Learnable Concepts • Some simple concepts cannot be represented compactly in DTs • Parity(x) = X1 xor X2 xor … xor Xn • Majority(x) = 1 if most of Xi’s are 1, 0 otherwise • Exponential size in # of attributes • Need exponential # of examples to learn exactly • The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

  30. 100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve

  31. Performance Issues • Assessing performance: • Training set and test set • Learning curve 100 Some concepts are unrealizable within a machine’s capacity % correct on test set size of training set Typical learning curve

  32. Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set 100 % correct on test set size of training set Typical learning curve Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting

  33. Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set Terminate recursion when # errors (or information gain) is small

  34. Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set Terminate recursion when # errors (or information gain) is small

  35. Statistical Methods for Addressing Overfitting / Noise • There may be few training examples that match the path leading to a deep node in the decision tree • More susceptible to choosing irrelevant/incorrect attributes when sample is small • Idea: • Make a statistical estimate of predictive power (which increases with larger samples) • Prune branches with low predictive power • Chi-squared pruning

  36. Top-down DT pruning • Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly • At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk • Chi-squared statistical significance test: • Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant) • Alternate hypothesis: examples not randomly chosen (X is relevant) • Prune X if testing X is not statistically significant

  37. Chi-Squared test • Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’ • Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds • Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom • Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)

  38. Performance Issues • Assessing performance: • Training set and test set • Learning curve • Overfitting • Tree pruning • Incorrect examples • Missing data • Multi-valued and continuous attributes

  39. Multi-Valued Attributes • Simple change: consider splits on all values A can take on • Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant • More values => dataset split into smaller example sets when picking attributes • Smaller example sets => more likely to fit well to spurious noise

  40. 7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7 Continuous Attributes • Continuous attributes can be converted into logical ones via thresholds • X => X<a • When considering splitting on X, pick the threshold a to minimize # of errors / entropy

  41. Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 F T T F x1

  42. Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 F F x2>=10 F T F T x1

  43. Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 T F x2>=10 x2>=15 F T F T T F x1

  44. Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

  45. Exercise • With 2 attributes, what kinds of decision boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth: • 1? • 2? • 3? • Describe the appearance and the complexity of these decision boundaries

  46. Reading • Next class: • Neural networks & function learning • R&N 18.6-7

More Related