Classification: Decision Tree

1. Classification: Decision Tree Jialei Wang SCGY-USTC tdwjl@mail.ustc.edu.cn 1 USTC KDD CUP TEAM

2. Classification : Definition Task: given a collection of records(or training set),each record contains a set of attributes, one of the attribute is the class. Develop a model based on the training data to predict the class label of the new data. The class should be assigned as accurately as possible. Classification is a major task in data mining research and KDD CUP competition. 2 USTC KDD CUP TEAM

3. Illustrating classification task 3 USTC KDD CUP TEAM

4. Classification:Induction and deduction Induction: summarize a set of predetermined classes, construct the model such as classification rules, decision trees, math formulae. Deduction: use the constructed model to classify the new objects, estimate the accuracy of the model and if the accuracy is acceptable, we can use this model to classify the tuples with unknown class labels. 4 USTC KDD CUP TEAM

5. Process one: Induction 5 USTC KDD CUP TEAM

6. Process two: Deduction 6 USTC KDD CUP TEAM

7. Classification techniques Decision Tree based Methods Rule-based Methods Artificial Neural Networks Na�ve Bayes and Bayesian Belief Networks SVM Bagging and Boosting � 7 USTC KDD CUP TEAM

8. Decision Tree: an important method Rank first in the top ten algorithms: 8 USTC KDD CUP TEAM

9. What is Decision Tree? Decision tree a flow-chart-like tree structure. Each node denotes a test on a attribute. Each branch represents an outcome of the test. Each leaf represents a class distribution. When classification, we start at the root node, test the attribute and then move down to tree branch, until we reach the leaf node. 9 USTC KDD CUP TEAM

10. Decision Tree: an example The induction process: 10 USTC KDD CUP TEAM

11. Decision Tree: an example The deduction process: 11 USTC KDD CUP TEAM






17. How to construct a decision tree? Basic strategy: construct a tree in a top-down recursive divide-and-conquer manner: At first, all training examples are at the root Data are partitioned recursively based on selected attributes. partitioning terminates if met the predesigned conditions for stopping recursion. 17 USTC KDD CUP TEAM

18. Decision tree: questions How to select the attributes when construct the decision tree? How to design the condition for stopping recursion? 18 USTC KDD CUP TEAM

19. Which attribute is best? Basic idea: Choose the attribute most useful for classifying examples. When the class labels after splitting is more pure, it will be more easy for classification. So we may choose the attribute which will make the class labels more pure. It depends on attribute types(nominal or continuous) and the number of ways to split(2-way split or multi-way split). 19 USTC KDD CUP TEAM

20. Measures of node impurity Information gain Gini index Gain ratio Misclassification error 20 USTC KDD CUP TEAM

21. Entropy Given a node S: S is the training data set, and pi is the proportion of S belong to class i. Entropy is in the range of 0 (all records belong one class) and log2c. The smaller the entropy, the purer the data set 21 USTC KDD CUP TEAM

22. Examples of computing entropy 22 USTC KDD CUP TEAM

23. Information Gain(ID3) Select the attribute with the highest information gain Based on the entropy: Value(A)is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v 23 USTC KDD CUP TEAM

24. Information Gain: example USTC KDD CUP TEAM 24

25. Information Gain: problem 25 USTC KDD CUP TEAM

26. Information Gain: problem Tends to prefer splits that result in large number of partitions, each being small but pure. We need not only gain in purity, but also retain generality. 26 USTC KDD CUP TEAM

27. Gain ratio(C4.5) Gain ratio: penalize attributes like date by incorporating split information (C4.5) Split information is sensitive to how broadly and uniformly the attribute splits the data 27 USTC KDD CUP TEAM

28. GINI index 28 USTC KDD CUP TEAM

29. GINI index(CART) 29 USTC KDD CUP TEAM

30. Classification error 30 USTC KDD CUP TEAM

31. Comparison among impurity criteria 31 USTC KDD CUP TEAM

32. When to stop splitting? Basic idea: the partitioning terminates if any of the constraints is met: all examples falling into a node belong to the same class this node becomes a leaf whose label is the class no attribute can be used to further partition the data this node becomes a leaf whose label is the majority class of the examples falling into the node no instance falling into a node this node becomes a leaf whose label is the majority class of the examples falling into the parent of the node 32 USTC KDD CUP TEAM

33. C4.5: an example Construction: Simple depth-first Impurity measure: information gain Needs entire data to fit in memory Unsuitable for Large Datasets 33 USTC KDD CUP TEAM

34. Some practical issues Underfitting and Overfitting Algorithm cost Missing values 34 USTC KDD CUP TEAM

35. Underfitting and Overfitting Underfitting: too simple model that causes both training and test errors are large Overfitting: decision trees that are more complex than necessary Reasons of overfitting: noise bias training set 35 USTC KDD CUP TEAM

36. Overfitting: solution Occam�s Razor: prefer the simplest hypothesis that fits the data. For complex models, there is a greater chance that it was fitted accidentally by errors in data, but a short tree is less likely to be a statistical coincidence. Since when a decision tree is built, many branches may reflect anomalies in the training set due to noise or outliers, we can use pruning to address the problem of overfitting 36 USTC KDD CUP TEAM

37. How to prune decision tree? Two popular methods: prepruning and postpruning Prepruning: Stop the algorithm before it becomes a fully-grown tree, we can design a more restrictive stopping conditions, but it sometimes hard to choose an appropriate threshold. Postpruning: remove branches from a �fully grown� tree in general, postpruning is more accurate than prepruning, yet requires more computational cost, the key is how to determine the correct final tree size 37 USTC KDD CUP TEAM

38. Enhance basic decision tree algorithm Continuous-valued attributes: Finding one or multiple splitting points to partition a continuous-valued attribute domain into a set of ranges, one branch a range Handle missing attribute values: assign the most common value of the attribute or assign probability to each of the possible values 38 USTC KDD CUP TEAM

39. Enhance basic decision tree algorithm SLIQ [Mehta et al., EDBT96] build an index for each attribute. Only class list and the current attribute list reside in memory SPRINT [J. Shafer et al., VLDB96] construct an attribute list. When a node is partitioned, the attribute list is also partitioned Rain Forest [J. Gehrke et al., VLDB98] build an AVC-list (attribute, value, class label) separate the scalability aspects from the criterion that determine the quality of the tree 39 USTC KDD CUP TEAM

40. Other enhancing methods Ensemble methods: construct a set of base classifiers and take a vote on predictions in classification Representational algorithms: Bagging Adaboosting Random forests 40 USTC KDD CUP TEAM

41. Summary 41 USTC KDD CUP TEAM

42. Summary Advantages: Inexpensive to construct Insensitive to errors in individual data Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is acceptable Disadvantages: No backtracking: may converge to a locally optimal solution Mining only a small set of rules and testing only one attribute at one time. Poor stabilization Accuracy is not so high 42 USTC KDD CUP TEAM

43. Summary: accuracy 43 USTC KDD CUP TEAM

44. Summary: accuracy 44 USTC KDD CUP TEAM

45. References Jiawei Han, Micheline Kamber. Data mining: concepts and techniques. 2nd China Machine Press. Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to data mining. Post&Telecom Press. Xindong Wu, Vipin Kumar. The Top Ten Algorithms in Data Mining. Chapman&Hall. Bing Liu. Web Data Mining. Tsinghua University Press. Some slides made by Jiawei Han, Jian Pei, Bing Liu, Pang-Ning Tan, Hong Cheng, Zhi-hua Zhou, Xindong Wu. USTC KDD CUP TEAM 45

46. Selected Readings J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993 USTC KDD CUP TEAM 46

47. The end THANK YOU! tdwjl@mail.ustc.edu.cn 47 USTC KDD CUP TEAM

Classification: Decision Tree

Classification: Decision Tree

Presentation Transcript

Decision Trees and Information: A Question of Bits

Decision tree software C4.5

Summary

Classification and regression trees

A Comparison of Decision Tree Pruning Strategies

Chapter 6 Classification and Prediction (2)

Lecture outline

Chapter 6. Classification and Prediction

Spatial and Temporal Data Mining

Classification: Decision Trees

Decision Tree Classifiers

Chapter 6. Classification and Prediction

Classification and Prediction

Decision Tree Learning

Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning

Decision Tree Modeling

Classification: Decision Trees

Decision Tree Learning

Data Mining: Classification and Prediction

Decision Analysis Ch. 8 22C:145 AI

Lecture 7 Classification

Decision tree based classification s of heterogeneous lung cancer data