1 / 91

AMCS/CS 340: Data Mining

Classification I: Decision Tree. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Given a collection of records ( training set ) Each record contains a set of attributes , one of the attributes is the class .

sulwyn
Download Presentation

AMCS/CS 340: Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification I: Decision Tree AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Classification: Definition Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  3. Test Set Model Classification Example categorical categorical continuous class Learn Classifier Training Set predicting borrowers who cheat on loan payments.

  4. Issues: Evaluating Classification Methods • Accuracy • classifier accuracy: how well the class labels of test data are predicted • Speed • time to construct the model (training time) • time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in large-scale data • Interpretability • understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 4

  5. Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Classification Techniques Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  6. categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  7. NO Another Example of Decision Tree categorical categorical continuous Single, Divorced MarSt class Married NO Refund No Yes TaxInc < 80K > 80K YES NO There could be more than one tree that fits the same data! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  8. Decision Tree Classification Task Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  9. Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree.

  10. Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  11. Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  12. Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  13. Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  14. Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. Decision Tree Classification Task Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are categorical (if continuous-valued, they are discretized in advance) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Algorithm for Decision Tree Induction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  17. Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Decision Tree Induction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  18. Let Dt be the set of training records that reach a node t General Procedure: IfDt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dtcontains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. General Structure of Hunt’s Algorithm Dt ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  19. Refund Yes No Don’t Cheat Don’t Cheat Refund Refund Yes No Yes No Don’t Cheat Marital Status Don’t Cheat Marital Status Single, Divorced Single, Divorced Married Married Don’t Cheat Taxable Income Cheat Don’t Cheat < 80K >= 80K Don’t Cheat Cheat Hunt’s Algorithm Don’t Cheat

  20. Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Issues of Hunt's Algorithm Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  21. Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split How to Specify Test Condition? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  22. Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Luxury Sports CarType CarType {Sports, Luxury} {Family, Luxury} {Family} {Sports} Splitting Based on Nominal Attributes OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  23. Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Size Small Large Medium Size Size {Small, Medium} {Medium, Large} {Large} {Small} Size {Small, Large} {Medium} Splitting Based on Ordinal Attributes OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. Different ways of handling Binary Decision: (A < v) or ( A >= v ) consider all possible splits and finds the best cut Discretization to form an ordinal categorical attribute ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Splitting Based on Continuous Attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  25. Splitting Based on Continuous Attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  26. Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Issues of Hunt's Algorithm Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  27. How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  28. Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: How to determine the Best Split Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  29. How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  30. Gini Index Gini impurity is a measure of how frequently a randomly chosen element from a set is incorrectly labeled if it were labeled randomly according to the distribution of labels in the subset. Entropy a measure of the uncertainty associated with a random variable. Misclassification error the proportion of misclassified samples Measures of Node Impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  31. Measures of Node Impurity • Gini Index • p( j | t) is the relative frequency of class jat node t • Entropy • Misclassification error Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  32. Comparison among Measures of Node Impurity For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  33. Comparison among Measures of Node Impurity For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  34. Quality of Split p: n When a node p is split into k partitions, the quality of split is computed as, Or information gain child 1: n1 GINI(1) child i: ni GINI(i) child k: nk GINI(k) where, ni = number of records at child i, n = number of records at node p. • Measures reduction in GINI/Entropy achieved because of the split. • Choose the split that achieves most reduction (maximizes GAIN) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  35. Quality of Split: binary attributes • Splits into two partitions (binary attributes) P Yes No Node N1 Node N2 Gini(N1) = 1 – (4/7)2 – (3/7)2= 0.4898 Gini(N2) = 1 – (2/5)2 – (3/5)2= 0.480 Ginisplit(Children) = 7/12 * 0.4898 + 5/12 * 0.480= 0.486 Gainsplit = Gini (parent) – Ginisplit(children) = 0.014

  36. Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Decision Tree Induction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  37. CART • CART: Classification and Regression Trees • constructs trees with only binary splits (simplifies the splitting criterion) • use Gini Index as a splitting criterion • split the attribute who provides the smallest Ginisplit(p) or the largest GAINsplit(p) • need to enumerate all the possible splitting points for each attribute Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  38. Sorted Values Split Positions Continuous Attributes: Computing Gini Index • For efficient computation: for each attribute, • Sort the attribute on values • Set candidate split positions as the midpoints between two adjacent sorted values. • Linearly scan these values, each time updating the count matrix and computing Gini index • Choose the split position that has the least Gini index

  39. Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Decision Tree Induction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  40. CarType CarType CarType {Family, Luxury} {Sports, Luxury} Family Luxury {Sports} {Family} Sports How to Find the Best Split Two-way split (find best partition of values) Multi-way split largest Gain = 0.337 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  41. Which Attribute to Split ? Gain = 0.02 Gain = 0.337 Gain = 0.5 Best ? • Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. • Unique value for each record not predictable • Small number of recodes in each node not reliable prediction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  42. Splitting Based on Gain Ratio • Gain Ratio: • Parentnode p is split into k partitions • where ni is the number of records in partition i • designed to overcome the disadvantage of Information Gain • adjusts Information Gain by the entropy of the partitioning (SplitINFO). • higher entropy partitioning (large number of small partitions) is penalized! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  43. Comparing Attribute Selection Measures • The three measures, in general, return good results but • Ginigain: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions • Information gain: • biased towards multivalued attributes • Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 43

  44. ID3 and C4.5 ID3 (Ross Quinlan1986) is the precursor to the C4.5 algorithm (Ross Quinlan 1993) C4.5 is an extension of earlier ID3 algorithm For each unused attribute Ai, count the information GAIN (ID3) or GAINRatio (C4.5) from splitting on Ai Find the best splitting attribute Abest with the highest GAIN or GAINRatio Create a decision node that splits on Abest Recur on the sublists obtained by splitting on Abest, and add those nodes as children of node Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  45. Improvements of C4.5 from ID3 algorithm Handling both continuous and discrete attributes In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Pruning trees after creation C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  46. C4.5 Issues: Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs a lot of computation at every stage of construction of decision tree. You can download the software from:http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/c4.5r8.tar.gz More information http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  47. Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Decision Tree Induction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  48. SLIQ – a decision tree classifier SLIQ, Supervised Learning In Quest(EDBT’96 — Mehta et al.) Uses a pre-sorting technique in the tree growing phase (eliminates the need to sort data at each node) creates a separate list for each attribute of the training data A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute listcould be kept in the memory at any time Suitable for classification of large disk-resident datasets Applies to both numerical and categorical attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  49. SLIQ Methodology Create decision tree by partitioning records Generate attribute list for each attribute Sort attribute lists for NUMERIC Attributes Start End Example Only NUMERIC attributes sorted Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  50. Numeric attributes splitting index States of class histograms Partition position Numeric attributes sorted Ginisplit =0.44 Position 0 Ginisplit =0.22 Position 3 Position 6 Ginisplit =0.44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

More Related