Decision Tree Learning Algorithms

Decision Tree Learning Algorithms Sagar Kasukurthy

DECISION TREE ALGORITHMS • One of the simplest forms of machine learning. • Supervised learning – Output for the training data is known. • Takes as input a vector of attribute values. Returns a single output value decision. • We build a decision tree first based on the training data and then apply the decision tree on the test sample. Decision - Whether new customer will default on credit card or not? Goal – To come up with a value, yes or no. Attributes – The Random variables in the problem are attributes.

Training Data

Generated Decision Tree

New Test Sample • Consider a person who is not a home owner, is Single and has an annual income of 94k. Would he default on the payment?

Observations on the tree • Do not need to check all attributes to make a decision. • Very intuitive. • Why is home owner the root node and not marital status?

Observations on the tree • Node Types • Root Node : No incoming edges, only outgoing edges • Example: Home Owner. • Internal Node : Exactly one incoming edge and >= 2 outgoing edges. • Example: Marital Status • Leaf Node: Exactly one incoming edge • Example : Class Label. • Edges : Represent possible values of the attributes.

Attribute Types • Binary Attribute – Two possible values • Example: Home Owner : Yes or No • Nominal Attributes – Many possible values • Can be split in (2k-1 -1) ways • Example: Marital Status: • (Single, Divorced, Married) • (Single, Divorced/Married) • (Single/Married, Divorced.

Attribute Types • Ordinal Attributes – Similar to Nominal expect grouping must not violate the order property of the attribute values. • Example: Shirt Size with values small, medium, large and extra large. Group only in the order small, medium, large, extra large. • Continuous Attributes • Binary Outcomes. example : Annual Income > 80k (Yes,No) • Range Query. example : Annual Income with branches: • <10k • 10k – 25k • 25k – 50k • 50k – 80k • >80k

Learning Algorithm • Aim: Find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree functionDTL(examples, attributes, parents_examples) returns a decision tree { if examples is empty thenreturn MAJORITY_VALUE(parent_examples) else if all examples have all same classification then return the classification else if attributes is empty then return MAJORITY_VALUE(examples) else best CHOOSE_BEST_ATTRIBUTE(attributes, examples) Tree a new decision tree with root test best For each value vi of bestdo examplesi { elements of examples with best = vi } subtree DTL( examplesi, attributes – best, MAJORITY_VALUE(examples)) add a branch to the tree with label viand subtreesubtree returnTree }

CHOOSE_BEST_ATTRIBUTE • Information gain at the attribute. • Choose the attribute with the higher information gain. • Equation: • I = Impurity Measure • N = number of samples • N(Vj) = number of samples when attribute V takes value Vj.

CHOOSE_BEST_ATTRIBUTE • Impurity Measure: Measure of the goodness of a split at a node. • When is a split pure? • A split is pure if after the split, for all branches, all the instances choosing a branch belong to the same class • The measures for selecting the best split are based on the degree of impurity of child nodes.

IMPURITY MEASURES • ENTROPY • GINI INDEX • MISCLASSIFICATION ERROR • C = number of classes • P(i/t) = fraction of records belonging to class i at node t

ENTROPY • Measure of uncertainty of a random variable. • More the uncertainty, higher the entropy. • Example: Coin toss which always comes up as heads. • No uncertainty, thus entropy = zero. • We gain no information by observing the value since the value is always heads. • Entropy: • Entropy: H(V) = • V = Random variable • P(Vk) – Probability of variable taking value Vk • For a fair coin H(Fair) = - (0.5log2 0.5 + 0.5log2 0.5 ) = 1

DECISION TREE USING ENTROPY AS IMPURITY MEASURE Let 0 be Class 0 and 1 be Class 1.

DECISION TREE USING ENTROPY AS IMPURITY MEASURE • I(Parent) • Total 8 samples, 3 class 0 and 5 class 1 • I(Parent) = -3/8log23/8 – 5/8log25/8 = 0.95

DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute A A takes value 0 , 3 class 0 and 1 class 1 N(Vj) = 4 N = 8 A takes value 1,0 class 0 and 4 class 1 N(Vj) = 4 N = 8 Information gain for attribute A = 0.95 – (-4/8(-3/4log23/4 + 1/4log21/4) – 4/8(-4/4log24/4 + 0/4log20/4)) = 0.54

DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute B B takes value 0 , 2 class 0 and 2 class 1 N(Vj) = 4 N = 8 B takes value 1,1 class 0 and 3 class 1 N(Vj) = 4 N = 8 Information gain for attribute B = 0.95 – (-4/8(-2/4log22/4 + 2/4log22/4) – 4/8(-1/4log21/4 + 3/4log23/4)) = 0.04

DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute C C takes value 0 , 2 class 0 and 2 class 1 N(Vj) = 4 N = 8 C takes value 1,1 class 0 and 3 class 1 N(Vj) = 4 N = 8 Information gain for attribute B = 0.95 – (-4/8(-2/4log22/4 + 2/4log22/4) – 4/8(-1/4log21/4 + 3/4log23/4)) = 0.04

DECISION TREE USING ENTROPY AS IMPURITY MEASURE • Information gain for Node A is the highest, so we use node A as the root node. • Now, we see that when A = 1, all samples belong to class label 1. • Now, remaining samples = 4 when A = 0. Need to calculate information gain for attributes B and C. • Class 0 = 3 and Class 1 = 1. I(Parent) = -3/4log23/4 – 1/4log21/84 = 0.81

DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute B B takes value 0 , 2 class 0 and 0 class 1 N(Vj) = 2 N = 4 B takes value 1,1 class 0 and 1 class 1 N(Vj) = 2 N = 4 Information gain for attribute B = 0.81 – (-2/4(-2/2log22/2 + 0/2log20/2) – 2/4(-1/2log21/2 + 1/2log21/2)) = 0.311

DECISION TREE USING ENTROPY AS IMPURITY MEASURE For attribute C C takes value 0 , 2 class 0 and 0 class 1 N(Vj) = 2 N = 4 C takes value 1,1 class 0 and 1 class 1 N(Vj) = 2 N = 4 Information gain for attribute C = 0.81 – (-2/4(-2/2log22/2 + 0/2log20/2) – 2/4(-1/2log21/2 + 1/2log21/2)) = 0.311

DECISION TREE USING ENTROPY AS IMPURITY MEASURE • Both B and C have same information gain.

ENTROPY, GINI • ENTROPY = Used by ID3, C4.5 and C5.0 algorithms • GINI Coefficient = Used by the CART algorithm

PERFORMANCE

OVERFITTING • Algorithm generates a large tree when there is no pattern. • Problem of predicting whether a roll of dice outputs 6 or not. • Carry experiments with various dice and decide to use attributes like color of the die, weight etc. • If in the experiments we saw that when we roll a 7 gram blue die, we got 6, the decision tree will build a pattern on that training sample.

REASONS FOR OVERFITTING • Choosing attributes with little meaning and try to satisfy noisy data. • Huge number of attributes • Small Training Data Set.

HOW TO COMBAT OVERFITTING - PRUNING • Eliminate irrelevant nodes – Nodes that have zero information gain. • A node that after split gives 50 Yes and 50 No on 100 examples

Problems associated with Decision Trees • Missing Data • Multi valued attributes • Attribute with many possible values, information gain may be high. But choosing this attribute first, might not yield the best tree. • Continuous attributes

Continuous attributes • Steps • Sort the records based on the integer value of the attribute. • Scan the values and each time update the count matrix of Yes/No and compute Impurity • Choose the split position that has the least impurity. • Splitting is the most expensive part of real-world decision tree learning applications.

Thank You

References • Artificial Intelligence – A modern approach –Third edition by Russel and Norvig. • Video Lecture – Prof. P. Dasgupta, Dept. of Computer Science, IIT Kharagpur. • Neural networks course Classroom lecture on decision tree – Dr. Eun Youn, Texas Tech Univ, Lubbock.

Decision Tree Learning Algorithms

Decision Tree Learning Algorithms

Presentation Transcript

Machine Learning Chapter 3. Decision Tree Learning

Decision Tree Learning

Decision Tree Learning

Decision Tree Learning

Decision Tree Learning

Decision tree learning

Decision Tree Learning

Decision Tree Learning

Ch 3. Decision Tree Learning

Decision Tree Learning

Classification Techniques: Decision Tree Learning

Decision Tree Learning

CS 391L: Machine Learning: Decision Tree Learning

Decision Tree Learning

Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorithms | Edureka

Decision tree learning

Decision Tree Learning

Decision Tree Learning