Decision Tree Learning

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Decision Tree Learning

Contents • Introduction • Decision Tree representation • Appropriate problems for Decision Tree learning • The basic Decision Tree learning algorithm (ID3) • Hypothesis space search in Decision Tree learning • Inductive bias in Decision Tree learning • Issues in Decision Tree learning • Summary

Introduction • Widely used practical methods for inductive inference • Approximating discrete valued functions • Search in a completely expressive hypothesis space • Inductive bias: prefering small trees to large ones • Robust to noisy data and capable of learning disjunctive expressions • Learned trees can also be re-represented as a set of if-then-rules • Algorithms: ID3, ASSISTENT, C4.5

Contents • Introduction • Decision Tree representation • Appropriate problems for Decision Tree learning • The basic Decision Tree learning algorithm (ID3) • Hypothesis space search in Decision Tree learning • Inductive bias in Decision Tree learning • Issues in Decision Tree learning • Summary

Decision Tree representation • A tree classifies instances: • Node: an attribute which describes an instance • Branch: possible values of the attribute • Leaf: class to which the instances belong • Procedure (of classifying): • An instance is classified by starting at the root node of the tree • Repeat: - test the attribute specified by the node - move down the tree branch corresponding to the value of the attribute-value in the given example • Example: classified as negative example • In general: a decision tree is a disjunction of constraints on the attribute values of the instances

Decision Tree Representation 2

Contents • Introduction • Decision Tree Representation • Appropriate Problems for Decision Tree Learning • The Basic Decision Tree Learning Algorithm (ID3) • Hypothesis Space Search in Decision Tree Learning • Inductive Bias in Decision Tree Learning • Issues in Decision Tree Learning • Summary

Appropriate Problems for Decision Tree Learning • Decision tree learning is generally best suited to the problems: • Instances are represented by attribute-value tuples: easiest: each attribute takes on a small number of disjoint possible valuesextension: handling real valued attributes • The target function has discrete output values:extension 1: learning function with more than two possible output values • Disjunctive description may be required • The training data may contain error: error in the classification of the training examples error in the attribute values that describe these example • The training data may contain missing attribute values • Classification Problems: Problems in which the task is to classify examples into one of the of possible categories

The Basic Decision Tree Learning Algorithm • Top-Down, greedy search through the space of possible decision trees • ID3 (Quinlan 1986), C45 (Quinlan 1993) and other variations • Question: Which attributes should be tested at a node of the tree? • Answer: • Statistical test to select the best attribute (how well it alone classifies the training examples) • Descendants of the root node created (each possible value of this attribute and training examples are sorted to the appropriate descendant node) • Process is then repeated • Algorithm never backtracks to reconsider earlier choices

The Basic Decision Tree Learning Algorithm 2 ID3 (examples, attributes) Begin Create root node if (examples = +) return root(+) if (examples = -) return root(-) if (attributes = empty) return root(most common value of the target_attr in examples) begin A = Gain(examples, attributes) attr(root) = A forall vi of A do Add_subtree(root, vi) examples_vi = (examples|value = vi) if (examples_vi = empty) Add_Leaf(most common value of target_attr in examples) else below this new branch add subtree ID3(examples_vi, attributes - {A}) end Return root end

Which Attribute Is the Best Classifier • INFORMATION GAIN: How well the given separates attribute the training examples: • ENTROPY: Characterizes the (im)purity of an arbitrary collection of examples • Given: a collection S of positive and negative examples • : proportion of positive examples in S : proportion of negative examples in S • Example: [9+, 5-] • Notice: • Entropy is 0 if all members belong to the same class • Entropy is 1 when the collection contains an equal number of positive and negative examples

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Which Attribute Is the Best Classifier • Entropy: specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S • Generally:is the proportion of S belonging to the class i The entropy function relative to a boolean classification, as the proportion of positive examples, varies between 0 and 1

Information Gain Measures the Expected Reduction in Entropy • INFORMATION GAIN Gain(S,A): Expected reduction in entropy caused by partitioning the examples according to this attribute A: • Values(A) is the set of all possible values for A • Example:Values(Wind) ={Weak,Strong} S= [9+,5-] ,

Information Gain Measures the Expected Reduction in Entropy 2

An Illustrative Example • ID3 determines the information gain for each candidate outputsGain(S,Outlook) = 0.246Gain(S, Humidity) = 0.151Gain(S,Wind) = 0.048Gain(S,Temperature) = 0.029 • Outlook provides the best predection; Outlook= Overcast all examples are positive

An Illustrative Example 2

An Illustrative Example 3 • The process continues for each new leaf node until: • Every attribute has already been included along the path through the tree • The training examples associated with this leaf node have all the same target attribute values

Hypothesis Space Search in Decision Tree Learning • Hypothesis space for ID3: The set of possible decision trees • ID3 performs simple to complex hill-climbing search through the hypothesis spaceBeginning: empty tree Considering: more elaborate hypothesisEvaluation function: Information gain

Hypothesis Space Search in Decision Tree Learning 2 • Capabilities and limitations: • ID3's hypothesis space of all decision trees is the complete space of finite discrete-valued functions, relative to the available attributes => every finite discrete-valued function can be represented by decision trees => avoids: hypothesis space might not contain the target function • Maintains only single current hypothesis <=> Candidate-Elimination Algorithm • No Backtracking in the search => converging to locally optimal solution • Using all training examples at each step => resulting search is much less sensitive to errors in individual training examples

Inductive Bias in Decision Tree Learning • INDUCTIVE BIAS: Set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. • In ID3: Basis is how to choose one consistent hypothesis over the others. ID3 search strategy: • Selects in favour shorter trees over larger ones • Selects tree where the attribute with highest information gain is closest to the root • Difficult to characterise bias precisely but approximately:Shorter trees are prefered over larger • Could imagine algorithms like ID3 but make breadth-first search (BSF-ID3)ID3 can be viewed as an efficient approximation of BSF-ID3 butit exhibits more complex bias. It does not always find the shortest tree.

Inductive Bias in Decision Tree Learning • A closer approximation to the inductive bias of ID3: Shorter trees are prefered over longer trees. Tree that place high information gain attributes close to the root are prefered over those that do not. • Occam's razor: Prefer the simplest hypothesis that fits the data

Restriction Biases and Preference Biases • Difference of inductive bias exhibited by ID3 and Candidate-Elimination: • ID3 searches a complete hypothesis space incompletely • Candidate-Elimination searches an incomplete hypothesis space completely • Inductive bias of ID3 follows from its search strategy.Inductive bias of Candidate-Elimination Algorithm follows from the definition of its search space • Inductive bias of ID3 is thus a preference to certain hypotheses over othersBias of Candidate-Elimination algorithm is considered in form of the categorical restriction on the set of hypotheses • Typically preference bias is more desirable than a restriction bias (learner can work within the complete hypothesis space) • Restriction bias (strictly limit the set of potential hypothesis) generally less desirable (possibility of excluding the unknown target function)

Issues in Decision Tree Learning • Include: • Determining how deep grows the decision tree • Handling continuous attributes • Choosing an appropriate attribute-selection measure • Handling training data with missing attribute-values • Handling attributes with different costs • Improving computational efficiency

Avoid Overfitting the Date • Definition: Given a hypothesis space H, a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that h has smaller error than h' over the training examples, but h' has a smaller error over the entire distribution of instances than h

Avoiding Overfitting the Data 2 • How can it be possible that a tree h better fits the training examples than h' but it performs more poorly over a subsequent examples? • Training examples contain random errors or noiseexample: adding following positive training example labeled incorrectly as negativeresult: sorted where D9 and D11 but ID3 search for further refinement • Small numbers of examples are associated with leaf nodes (coincidental regularities) • Experimental study of ID3 involving five different learning tasks (noisy, nondeterministic) overfitting decrease the accuracy 10-20% • APPROACHES: • Stop the growing of the tree earlier, before it reaches the point where it perfectly classifies the training data • Allow the tree to overfit the data and then post-prune the tree

Avoiding Overfitting the Data 3 • Criterion to determine the correct, final tree size: • Training and validation set: Use a set of examples separated from the training examples to evaluate the utility of post-pruning nodes from tree • Use all the available data for training, but apply a statistical test to estimate whether expanding (pruning) a particular node is likely to produce an improvement beyond the training set • Use an explicit measure of the complexity for encoding the training examples and decision tree

Reduced Error Pruning • How exactly might a validation set be used to prevent overfitting? • Reduced-error pruning: Consider each of the decision nodes to be a candidate for pruning Pruning means to substitute a subtree rooted at the node, by a leaf which the most common class of the training examples assigned • Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set • Nodes are pruned iteratively by choosing the node whose removal most increases the accurancy of decision tree over validation set • Continue until further pruning is necessary

Reduced Error Pruning 2 • Here the validation set used for pruning is distinct from both the training and test sets • Disadvantage: Data is limited (withholding part of it for the validation set reduces even further the number of examples available for training) • Many additional techniques have been proposed

Rule Post-Pruning • FOLLOWING STEP: • Infer decision tree growing until the training data fit as well as possible and allow overfitting to occur • Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root to a leaf node • Prune each rule by removing any preconditions that result in improving its estimated accuracy • Sort the pruned rules by their estimated accuracy and consider them in this sequence when classifying subsequent instances • Example:

Rule Post-Prunning 2 • One rule is generated for each leaf node in the tree • Antecedent: Each attribute test along the path from the root to the leafConsequent: The classification at the leafExample: • Removing any ancendent, whose removal does not worsen its estimated accuracy:Example: removing • C4.5 evaluates performance by using a pessimistic estimate: • Calculating the rule accuracy over the training example • Calculating the standard deviation in the estimated accuracy assuming a binomial distribution • For the given confidencial level the lower-bound estimate is then taken as measure of rule performance • Advantage: For large data sets the pessimistic estimate is very close to the observed accuracy

Rule Post-Prunning 3 • Why is it good to convert decision tree to rules before prunning? • Allows distinguishing among the different contexts in which decision tree is used • 1 path = 1 rule => pruning can be made differently for each path • Removes the distinction between attribute test that occur near the root of the tree and those that occur near to leaves • Avoid the reorganisation of the tree if the root node is pruned • Converting to rules improves readability • Rules are often easier to understand for people

Summary • Decision trees are a practical method for concept learning and for learning other discrete-valued function • Infers decision trees • ID3 searches the complete hypothesis space => avoids that the target function might be not presented in the hypothesis space • Inductive bias implicit in ID3 includes preference for smaller trees • Overfitting • Extension

Decision Tree Learning