1 / 36

Decision Tree Learning

Lehrstuhl für Informatik 2. Gabriella Kókai: Maschine Learning. Decision Tree Learning. Contents. Introduction Decision Tree representation Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning

Download Presentation

Decision Tree Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Decision Tree Learning

  2. Contents • Introduction • Decision Tree representation • Appropriate problems for Decision Tree learning • The basic Decision Tree learning algorithm (ID3) • Hypothesis space search in Decision Tree learning • Inductive bias in Decision Tree learning • Issues in Decision Tree learning • Summary

  3. Introduction • Widely used practical methods for inductive inference • Approximating discrete valued functions • Search in a completely expressive hypothesis space • Inductive bias: prefering small trees to large ones • Robust to noisy data and capable of learning disjunctive expressions • Learned trees can also be re-represented as a set of if-then-rules • Algorithms: ID3, ASSISTENT, C4.5

  4. Contents • Introduction • Decision Tree representation • Appropriate problems for Decision Tree learning • The basic Decision Tree learning algorithm (ID3) • Hypothesis space search in Decision Tree learning • Inductive bias in Decision Tree learning • Issues in Decision Tree learning • Summary

  5. Decision Tree representation • A tree classifies instances: • Node: an attribute which describes an instance • Branch: possible values of the attribute • Leaf: class to which the instances belong • Procedure (of classifying): • An instance is classified by starting at the root node of the tree • Repeat: - test the attribute specified by the node - move down the tree branch corresponding to the value of the attribute-value in the given example • Example: classified as negative example • In general: a decision tree is a disjunction of constraints on the attribute values of the instances

  6. Decision Tree Representation 2

  7. Contents • Introduction • Decision Tree Representation • Appropriate Problems for Decision Tree Learning • The Basic Decision Tree Learning Algorithm (ID3) • Hypothesis Space Search in Decision Tree Learning • Inductive Bias in Decision Tree Learning • Issues in Decision Tree Learning • Summary

  8. Appropriate Problems for Decision Tree Learning • Decision tree learning is generally best suited to the problems: • Instances are represented by attribute-value tuples: easiest: each attribute takes on a small number of disjoint possible valuesextension: handling real valued attributes • The target function has discrete output values:extension 1: learning function with more than two possible output values • Disjunctive description may be required • The training data may contain error: error in the classification of the training examples error in the attribute values that describe these example • The training data may contain missing attribute values • Classification Problems: Problems in which the task is to classify examples into one of the of possible categories

  9. Contents • Introduction • Decision Tree Representation • Appropriate Problems for Decision Tree Learning • The Basic Decision Tree Learning Algorithm (ID3) • Hypothesis Space Search in Decision Tree Learning • Inductive Bias in Decision Tree Learning • Issues in Decision Tree Learning • Summary

  10. The Basic Decision Tree Learning Algorithm • Top-Down, greedy search through the space of possible decision trees • ID3 (Quinlan 1986), C45 (Quinlan 1993) and other variations • Question: Which attributes should be tested at a node of the tree? • Answer: • Statistical test to select the best attribute (how well it alone classifies the training examples) • Descendants of the root node created (each possible value of this attribute and training examples are sorted to the appropriate descendant node) • Process is then repeated • Algorithm never backtracks to reconsider earlier choices

  11. The Basic Decision Tree Learning Algorithm 2 ID3 (examples, attributes) Begin Create root node if (examples = +) return root(+) if (examples = -) return root(-) if (attributes = empty) return root(most common value of the target_attr in examples) begin A = Gain(examples, attributes) attr(root) = A forall vi of A do Add_subtree(root, vi) examples_vi = (examples|value = vi) if (examples_vi = empty) Add_Leaf(most common value of target_attr in examples) else below this new branch add subtree ID3(examples_vi, attributes - {A}) end Return root end

  12. Which Attribute Is the Best Classifier • INFORMATION GAIN: How well the given separates attribute the training examples: • ENTROPY: Characterizes the (im)purity of an arbitrary collection of examples • Given: a collection S of positive and negative examples • : proportion of positive examples in S : proportion of negative examples in S • Example: [9+, 5-] • Notice: • Entropy is 0 if all members belong to the same class • Entropy is 1 when the collection contains an equal number of positive and negative examples

  13. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Which Attribute Is the Best Classifier • Entropy: specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S • Generally:is the proportion of S belonging to the class i The entropy function relative to a boolean classification, as the proportion of positive examples, varies between 0 and 1

  14. Information Gain Measures the Expected Reduction in Entropy • INFORMATION GAIN Gain(S,A): Expected reduction in entropy caused by partitioning the examples according to this attribute A: • Values(A) is the set of all possible values for A • Example:Values(Wind) ={Weak,Strong} S= [9+,5-] ,

  15. Information Gain Measures the Expected Reduction in Entropy 2

  16. An Illustrative Example • ID3 determines the information gain for each candidate outputsGain(S,Outlook) = 0.246Gain(S, Humidity) = 0.151Gain(S,Wind) = 0.048Gain(S,Temperature) = 0.029 • Outlook provides the best predection; Outlook= Overcast all examples are positive

  17. An Illustrative Example 2

  18. An Illustrative Example 3 • The process continues for each new leaf node until: • Every attribute has already been included along the path through the tree • The training examples associated with this leaf node have all the same target attribute values

  19. Contents • Introduction • Decision Tree Representation • Appropriate Problems for Decision Tree Learning • The Basic Decision Tree Learning Algorithm (ID3) • Hypothesis Space Search in Decision Tree Learning • Inductive Bias in Decision Tree Learning • Issues in Decision Tree Learning • Summary

  20. Hypothesis Space Search in Decision Tree Learning • Hypothesis space for ID3: The set of possible decision trees • ID3 performs simple to complex hill-climbing search through the hypothesis spaceBeginning: empty tree Considering: more elaborate hypothesisEvaluation function: Information gain

  21. Hypothesis Space Search in Decision Tree Learning 2 • Capabilities and limitations: • ID3's hypothesis space of all decision trees is the complete space of finite discrete-valued functions, relative to the available attributes => every finite discrete-valued function can be represented by decision trees => avoids: hypothesis space might not contain the target function • Maintains only single current hypothesis <=> Candidate-Elimination Algorithm • No Backtracking in the search => converging to locally optimal solution • Using all training examples at each step => resulting search is much less sensitive to errors in individual training examples

  22. Contents • Introduction • Decision Tree Representation • Appropriate Problems for Decision Tree Learning • The Basic Decision Tree Learning Algorithm (ID3) • Hypothesis Space Search in Decision Tree Learning • Inductive Bias in Decision Tree Learning • Issues in Decision Tree Learning • Summary

  23. Inductive Bias in Decision Tree Learning • INDUCTIVE BIAS: Set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. • In ID3: Basis is how to choose one consistent hypothesis over the others. ID3 search strategy: • Selects in favour shorter trees over larger ones • Selects tree where the attribute with highest information gain is closest to the root • Difficult to characterise bias precisely but approximately:Shorter trees are prefered over larger • Could imagine algorithms like ID3 but make breadth-first search (BSF-ID3)ID3 can be viewed as an efficient approximation of BSF-ID3 butit exhibits more complex bias. It does not always find the shortest tree.

  24. Inductive Bias in Decision Tree Learning • A closer approximation to the inductive bias of ID3: Shorter trees are prefered over longer trees. Tree that place high information gain attributes close to the root are prefered over those that do not. • Occam's razor: Prefer the simplest hypothesis that fits the data

  25. Restriction Biases and Preference Biases • Difference of inductive bias exhibited by ID3 and Candidate-Elimination: • ID3 searches a complete hypothesis space incompletely • Candidate-Elimination searches an incomplete hypothesis space completely • Inductive bias of ID3 follows from its search strategy.Inductive bias of Candidate-Elimination Algorithm follows from the definition of its search space • Inductive bias of ID3 is thus a preference to certain hypotheses over othersBias of Candidate-Elimination algorithm is considered in form of the categorical restriction on the set of hypotheses • Typically preference bias is more desirable than a restriction bias (learner can work within the complete hypothesis space) • Restriction bias (strictly limit the set of potential hypothesis) generally less desirable (possibility of excluding the unknown target function)

  26. Contents • Introduction • Decision Tree Representation • Appropriate Problems for Decision Tree Learning • The Basic Decision Tree Learning Algorithm (ID3) • Hypothesis Space Search in Decision Tree Learning • Inductive Bias in Decision Tree Learning • Issues in Decision Tree Learning • Summary

  27. Issues in Decision Tree Learning • Include: • Determining how deep grows the decision tree • Handling continuous attributes • Choosing an appropriate attribute-selection measure • Handling training data with missing attribute-values • Handling attributes with different costs • Improving computational efficiency

  28. Avoid Overfitting the Date • Definition: Given a hypothesis space H, a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that h has smaller error than h' over the training examples, but h' has a smaller error over the entire distribution of instances than h

  29. Avoiding Overfitting the Data 2 • How can it be possible that a tree h better fits the training examples than h' but it performs more poorly over a subsequent examples? • Training examples contain random errors or noiseexample: adding following positive training example labeled incorrectly as negativeresult: sorted where D9 and D11 but ID3 search for further refinement • Small numbers of examples are associated with leaf nodes (coincidental regularities) • Experimental study of ID3 involving five different learning tasks (noisy, nondeterministic) overfitting decrease the accuracy 10-20% • APPROACHES: • Stop the growing of the tree earlier, before it reaches the point where it perfectly classifies the training data • Allow the tree to overfit the data and then post-prune the tree

  30. Avoiding Overfitting the Data 3 • Criterion to determine the correct, final tree size: • Training and validation set: Use a set of examples separated from the training examples to evaluate the utility of post-pruning nodes from tree • Use all the available data for training, but apply a statistical test to estimate whether expanding (pruning) a particular node is likely to produce an improvement beyond the training set • Use an explicit measure of the complexity for encoding the training examples and decision tree

  31. Reduced Error Pruning • How exactly might a validation set be used to prevent overfitting? • Reduced-error pruning: Consider each of the decision nodes to be a candidate for pruning Pruning means to substitute a subtree rooted at the node, by a leaf which the most common class of the training examples assigned • Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set • Nodes are pruned iteratively by choosing the node whose removal most increases the accurancy of decision tree over validation set • Continue until further pruning is necessary

  32. Reduced Error Pruning 2 • Here the validation set used for pruning is distinct from both the training and test sets • Disadvantage: Data is limited (withholding part of it for the validation set reduces even further the number of examples available for training) • Many additional techniques have been proposed

  33. Rule Post-Pruning • FOLLOWING STEP: • Infer decision tree growing until the training data fit as well as possible and allow overfitting to occur • Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root to a leaf node • Prune each rule by removing any preconditions that result in improving its estimated accuracy • Sort the pruned rules by their estimated accuracy and consider them in this sequence when classifying subsequent instances • Example:

  34. Rule Post-Prunning 2 • One rule is generated for each leaf node in the tree • Antecedent: Each attribute test along the path from the root to the leafConsequent: The classification at the leafExample: • Removing any ancendent, whose removal does not worsen its estimated accuracy:Example: removing • C4.5 evaluates performance by using a pessimistic estimate: • Calculating the rule accuracy over the training example • Calculating the standard deviation in the estimated accuracy assuming a binomial distribution • For the given confidencial level the lower-bound estimate is then taken as measure of rule performance • Advantage: For large data sets the pessimistic estimate is very close to the observed accuracy

  35. Rule Post-Prunning 3 • Why is it good to convert decision tree to rules before prunning? • Allows distinguishing among the different contexts in which decision tree is used • 1 path = 1 rule => pruning can be made differently for each path • Removes the distinction between attribute test that occur near the root of the tree and those that occur near to leaves • Avoid the reorganisation of the tree if the root node is pruned • Converting to rules improves readability • Rules are often easier to understand for people

  36. Summary • Decision trees are a practical method for concept learning and for learning other discrete-valued function • Infers decision trees • ID3 searches the complete hypothesis space => avoids that the target function might be not presented in the hypothesis space • Inductive bias implicit in ID3 includes preference for smaller trees • Overfitting • Extension

More Related