1 / 14

Decision Tree Learning

Decision Tree Learning. Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014. Example: Age, Income and Owning a flat. Training set Owns a house Does not own a house. Monthly income (thousand rupees). L 1. L 2. Age.

Download Presentation

Decision Tree Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014

  2. Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) L1 L2 Age • If the training data was as above • Could we define some simple rules by observation? • Any point above the line L1  Owns a house • Any point to the right of L2  Owns a house • Any other point  Does not own a house

  3. Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) L1 L2 In general, the data won’t be such as above Age

  4. Example: Age, Income and Owning a flat • Training set • Owns a house • Does not own a house Monthly income (thousand rupees) Age • Approach: recursively split the data into partitions so that each partition becomes purer till … How to decide the split? How to measure purity? When to stop?

  5. Approach for splitting • What are the possible lines for splitting? • For each variable, midpoints between pairs of consecutive values for the variable • How many? • If N = number of points in training set and m = number of variables • About O(N × m) • How to choose which line to use for splitting? • The line which reduceimpurity (~ heterogeneity of composition) the most • How to measure impurity?

  6. Gini Index for Measuring Impurity • Suppose there are Cclasses • Let p(i|t)= fraction of observations belonging to class iin rectangle (node) t • Gini index: • If all observations in tbelong to one single class Gini(t) = 0 • When is Gini(t) maximum?

  7. Entropy • Average amount of information contained • From another point of view – average amount of information expected – hence amount of uncertainty • We will study this in more detail later • Entropy: Where 0 log20 is defined to be 0

  8. Classification Error • What if we stop the tree building at a node • That is, do not create any further branches for that node • Make that node a leaf • Classify the node with the most frequent class present in the node • Classification error as measure of impurity This rectangle (node) is still impure • Intuitively – the impurity of the most frequent class in the rectangle (node)

  9. The Full Blown Tree Root 1000 Number of points • Recursive splitting • Suppose we don’t stop until all nodes are pure • A large decision tree with leaf nodes having very few data points • Does not represent classes well • Overfitting • Solution: • Stop earlier, or • Prune back the tree 400 600 200 200 160 240 2 1 5 Statistically not significant

  10. Prune back • Pruning step: collapse leaf nodes and make the immediate parent a leaf node • Effect of pruning • Lose purity of nodes • But were they really pure or was that a noise? • Too many nodes ≈ noise • Trade-off between loss of purity and gain in complexity Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 Prune Leaf node (label = Y) Freq = 7

  11. Prune back: cost complexity • Cost complexity of a (sub)tree: • Classification error (based on training data) and a penalty for size of the tree Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 • Err(T) is the classification error • L(T) = number of leaves in T • Penalty factor α is between 0 and 1 • If α=0, no penalty for bigger tree Prune Leaf node (label = Y) Freq = 7

  12. Different Decision Tree Algorithms • Chi-square Automatic Interaction Detector (CHAID) • Gordon Kass (1980) • Stop subtree creation if not statistically significant by chi-square test • Classification and Regression Trees (CART) • Breiman et al. • Decision tree building by Gini’s index • Iterative Dichotomizer 3 (ID3) • Ross Quinlan (1986) • Splitting by information gain (difference in entropy) • C4.5 • Quinlan’s next algorithm, improved over ID3 • Bottom up pruning, both categorical and continuous variables • Handling of incomplete data points • C5.0 • Ross Quinlan’s commercial version

  13. Properties of Decision Trees • Non parametric approach • Does not require any prior assumptions regarding the probability distribution of the class and attributes • Finding an optimal decision tree is an NP-complete problem • Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning • Fast to generate, fast to classify • Easy to interpret or visualize • Error propagation • An error at the top of the tree propagates all the way down

  14. References • Introduction to Data Mining, by Tan, Steinbach, Kumar • Chapter 4is available online: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf

More Related