Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning

Decision Tree Classification Prof. Navneet GoyalBITS, PilaniBITS C464 – Machine Learning

General Approach Figure taken from text book (Tan, Steinbach, Kumar)

Classification by Decision Tree Induction • Decision tree – is a classification scheme • Represents – a model of different classes • Generates – tree & set of rules • A node without children - is a leaf node. Otherwise an internal node. • Each internal node has - an associated splitting predicate. e.g. binary predicates. • Example predicates: • Age <= 20 • Profession in {student, teacher} 5000*Age + 3*Salary – 10000 > 0

Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample • Test the attribute values of the sample against the decision tree

Classification by Decision Tree Induction Decision tree classifiers are very popular. WHY? • It does not require any domain knowledge or parameter setting, and is therefore suitable for exploratory knowledge discovery • DTs can handle high dimensional data • Representation of acquired knowledge in tree form is intuitive and easy to assimilate by humans • Learning and classification steps are simple & fast • Good accuracy

Classification by Decision Tree Induction Main Algorithms • Hunt’s algorithm • ID3 • C4.5 • CART • SLIQ,SPRINT

categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data Figure taken from text book (Tan, Steinbach, Kumar)

NO Another Example of Decision Tree categorical categorical continuous class Single, Divorced MarSt Married NO Refund No Yes TaxInc < 80K > 80K YES NO There could be more than one tree that fits the same data! Figure taken from text book (Tan, Steinbach, Kumar)

Some Questions • Which tree is better and why? • How many decision trees? • How to find the optimal tree? • Is it computationally feasible? • (Try constructing a suboptimal tree in reasonable amount of time – greedy algorithm) • What should be the order of split? • Look for answers in “20 questions” & “Guess Who” games!

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree. Figure taken from text book (Tan, Steinbach, Kumar)

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Figure taken from text book (Tan, Steinbach, Kumar)

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Figure taken from text book (Tan, Steinbach, Kumar)

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Figure taken from text book (Tan, Steinbach, Kumar)

Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO Figure taken from text book (Tan, Steinbach, Kumar)

Decision Trees: Example Outlook Temp Humidity Windy Class Sunny 79 90 true No play Sunny 56 70 False Play Sunny 79 75 True Play Sunny 60 90 True No Play Overcast 88 88 False Play Overcast 63 75 True Play Overcast 88 95 False Play Rain 78 60 False Play Rain 66 70 False No Play Rain 68 60 True No Play Training Data Set Numerical Attributes Temprature, Humidity Categorical Attributes Outlook, Windy Class ??? Class label

Outlook rain sunny overcast Windy Humidity Play <=75 true false > 75 No Play Play Play No Play No {1} Decision Trees: Example Sample Decision Tree Five leaf nodes – Each represents a rule

Decision Trees: Example Rules corresponding to the given tree • If it is a sunny day and humidity is not above 75%, then play. • If it is a sunny day and humidity is above 75%, then do not play. • If it is overcast, then play. • If it is rainy and not windy, then play. • If it is rainy and windy, then do not play. Is it the best classification ????

Decision Trees: Example Classification of new record New record: outlook=rain, temp =70, humidity=65, windy=true. Class: “No Play” Accuracy of the classifier determined by the percentage of the test data set that is correctly classified

Decision Trees: Example Outlook Temp Humidity Windy Class Sunny 79 90 true Play Sunny 56 70 False Play Sunny 79 75 True No Play Sunny 60 90 True No Play Overcast 88 88 False No Play Overcast 63 75 True Play Overcast 88 95 False Play Rain 78 60 False Play Rain 66 70 False No Play Rain 68 60 True Play Test Data Set Rule 1: two records Sunny & hum <=75 (one is correctly classified) Accuracy= 50% Rule 2:sunny, hum> 75 Accuracy = 50% Rule 3: overcast Accuracy= 66%

Practical Issues of Classification • Underfitting and Overfitting • Missing Values • Costs of Classification

Overfitting the Data • A classification model commits two kinds of errors: • Training Errors (TE) (resubstitution, apparent errors) • Generalization Errors (GE) • A good classification model must have low TE as well as low GE • A model that fits the training data too well can have high GE than a model with high TE • This problem is known as model overfitting

Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large. TE & GE are large when the size of the tree is very small. It occurs because the model is yet to learn the true structure of the data and as a result it performs poorly on both training and test sets Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting the Data • When a decision tree is built, many of the branches may reflect anomalies in the training data due to noise or outliers. • We may grow the tree just deeply enough to perfectly classify the training data set. • This problem is known as overfitting the data.

Overfitting the Data • TE of a model can be reduced by increasing the model complexity • Leaf nodes of the tree can be expanded until it perfectly fits the training data • TE for such a complex tree = 0 • GE can be large because the tree may accidently fit noise points in the training set • Overfitting & underfitting are two pathologies that are related to model complexity

Occam’s Razor • Given two models of similar generalization errors, one should prefer the simpler model over the more complex model • For complex models, there is a greater chance that it was fitted accidentally by errors in data • Therefore, one should include model complexity when evaluating a model

Definition A decision tree T is said to overfit the training data if there exists some other tree T’ which is a simplification of T, such that T has smaller error than T’ over the training set but T’ has a smaller error than T over the entire distribution of the instances.

Problems of Overfitting Overfitting can lead to many difficulties: • Overfitted models are incorrect. • Require more space and more computational resources • Require collection of unnecessary features • They are more difficult to comprehend

Overfitting Overfitting can be due to: Presence of Noise Lack of representative samples

Overfitting: Example Presence of Noise: Training Set Table taken from text book (Tan, Steinbach, Kumar)

Overfitting: Example Presence of Noise:Test Set Table taken from text book (Tan, Steinbach, Kumar)

Body Temp Body Temp Warm blooded Warm blooded Gives Birth Gives Birth No No Yes Yes 4-legged Mammals Non-mammals Yes Non-mammals Mammals Overfitting: Example Presence of Noise: Models Cold blooded Cold blooded Non-mammals Non-mammals Non-mammals Model M2 TE = 20%, GE=10% No Model M1 TE = 0%, GE=30% Find out why? Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting: Example Lack of representative samples: Training Set Table taken from text book (Tan, Steinbach, Kumar)

Body Temp Warm blooded Hibernates No Yes 4-legged Yes Non-mammals Mammals Overfitting: Example Lack of representative samples: Training Set Cold blooded Model M3 TE = 0%, GE=30% Find out why? Non-mammals Non-mammals No Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting due to Noise Decision boundary is distorted by noise point Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Figure taken from text book (Tan, Steinbach, Kumar)

How to Address Overfitting • Pre-Pruning (Early Stopping Rule) • Stop the algorithm before it becomes a fully-grown tree • Typical stopping conditions for a node: • Stop if all instances belong to the same class • Stop if all the attribute values are the same • More restrictive conditions: • Stop if number of instances is less than some user-specified threshold • Stop if class distribution of instances are independent of the available features (e.g., using  2 test) • Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

How to Address Overfitting… • Post-pruning • Grow decision tree to its entirety • Trim the nodes of the decision tree in a bottom-up fashion • If generalization error improves after trimming, replace sub-tree by a leaf node. • Class label of leaf node is determined from majority class of instances in the sub-tree • Can use MDL for post-pruning

Post-pruning • Subtree replacement replaces a subtree with a single leaf node Post-pruning approach-removes branches of a fully grown tree. Alt Alt Yes Yes Yes Price $$$ $ $$ No Yes Yes

Post-pruning Alt Alt Yes Yes Res Price Yes No $$$ $ $$ No Price No 4/4 Yes Yes $$$ $ $$ No Yes Yes • Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent

Overfitting: Example Presence of Noise:Training Set Table taken from text book (Tan, Steinbach, Kumar)

Post-pruning: Techniques • Cost Complexity pruning Algorithm: pruning operation is performed if it does not increase the estimated error rate. • Of course, error on the training data is not the useful estimator (would result in almost no pruning) • Minimum Description Length Algorithm: states that the best tree is the one that can be encoded using the fewest number of bits. • The challenge for the pruning phase is to find the subtree that can be encoded with the least number of bits.

Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • Let y={y1,y2,…yc} be the class labels Step 1: • If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt. If Dt is an empty set, then t is a leaf node labeled by the default class, yd Step 2: • If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each child node Dt ? Figure taken from text book (Tan, Steinbach, Kumar)

Refund Refund Yes No Yes No Don’t Cheat Marital Status Don’t Cheat Marital Status Single, Divorced Refund Married Married Single, Divorced Yes No Don’t Cheat Taxable Income Cheat Don’t Cheat Don’t Cheat Don’t Cheat < 80K >= 80K Don’t Cheat Cheat Hunt’s Algorithm Don’t Cheat Figure taken from text book (Tan, Steinbach, Kumar)

Hunt’s Algorithm • Should handle the following additional conditions: • Child nodes created in step 2 are empty. When can this happen? Declare the node as leaf node (majority class label of the training records of parent node) • In step 2, if all the records associated with Dt have identical attributes (except for the class label), then it is not possible to split these records further. Declare the node as leaf with the same class label as the majority class of training records associated with this node.

Tree Induction • Greedy strategy. • Split the records based on an attribute test that optimizes certain criterion. • Issues • Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? • Determine when to stop splitting

Hunt’s Algorithm • Design Issues of Decision Tree Induction • How should the training records be split? At each recursive step, an attribute test condition must be selected. Algorithm must provide a method for specifying the test condition for diff. attrib. types as well as an objective measure for evaluating the goodness of each test condition • How should the splitting procedure stop? Stopping condition is needed to terminate the tree-growing process. Stop when: - all records belong to the same class - all records have identical values - both conditions are sufficient to stop any DT induction algo., other criterion can be imposed to terminate the procedure early (do we need to do this? Think of model over-fitting!)

How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Slide taken from DM book slides available at companion website (Tan, Steinbach, Kumar)

How to determine the Best Split? • Greedy approach: • Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning