Classification : Definition. Task: given a collection of records(or training set),each record contains a set of attributes, one of the attribute is the class. Develop a model based on the training data to predict the class label of the new data. The class should be assigned as accurat
1. Classification: Decision Tree
1 USTC KDD CUP TEAM
2. Classification : Definition Task: given a collection of records(or training set),each record contains a set of attributes, one of the attribute is the class. Develop a model based on the training data to predict the class label of the new data.
The class should be assigned as accurately as possible.
Classification is a major task in data mining research and KDD CUP competition. 2 USTC KDD CUP TEAM
3. Illustrating classification task 3 USTC KDD CUP TEAM
4. Classification:Induction and deduction Induction: summarize a set of predetermined classes, construct the model such as classification rules, decision trees, math formulae.
Deduction: use the constructed model to classify the new objects, estimate the accuracy of the model and if the accuracy is acceptable, we can use this model to classify the tuples with unknown class labels.
4 USTC KDD CUP TEAM
5. Process one: Induction 5 USTC KDD CUP TEAM
6. Process two: Deduction 6 USTC KDD CUP TEAM
7. Classification techniques Decision Tree based Methods
Artificial Neural Networks
Naïve Bayes and Bayesian Belief Networks
Bagging and Boosting
7 USTC KDD CUP TEAM
8. Decision Tree: an important method Rank first in the top ten algorithms:
8 USTC KDD CUP TEAM
9. What is Decision Tree? Decision tree a flow-chart-like tree structure.
Each node denotes a test on a attribute.
Each branch represents an outcome of the test.
Each leaf represents a class distribution.
When classification, we start at the root node, test the attribute and then move down to tree branch, until we reach the leaf node. 9 USTC KDD CUP TEAM
10. Decision Tree: an example The induction process:
10 USTC KDD CUP TEAM
11. Decision Tree: an example The deduction process:
11 USTC KDD CUP TEAM
12. Decision Tree: an example The deduction process:
12 USTC KDD CUP TEAM
13. Decision Tree: an example The deduction process:
13 USTC KDD CUP TEAM
14. Decision Tree: an example The deduction process:
14 USTC KDD CUP TEAM
15. Decision Tree: an example The deduction process:
15 USTC KDD CUP TEAM
16. Decision Tree: an example The deduction process:
16 USTC KDD CUP TEAM
17. How to construct a decision tree? Basic strategy: construct a tree in a top-down recursive divide-and-conquer manner:
At first, all training examples are at the root
Data are partitioned recursively based on selected attributes.
partitioning terminates if met the predesigned conditions for stopping recursion. 17 USTC KDD CUP TEAM
18. Decision tree: questions How to select the attributes when construct the decision tree?
How to design the condition for stopping recursion?
18 USTC KDD CUP TEAM
19. Which attribute is best? Basic idea: Choose the attribute most useful for classifying examples.
When the class labels after splitting is more pure, it will be more easy for classification. So we may choose the attribute which will make the class labels more pure.
It depends on attribute types(nominal or continuous) and the number of ways to split(2-way split or multi-way split). 19 USTC KDD CUP TEAM
20. Measures of node impurity Information gain
20 USTC KDD CUP TEAM
21. Entropy Given a node S:
S is the training data set, and pi is the proportion of S belong to class i.
Entropy is in the range of 0 (all records belong one class) and log2c.
The smaller the entropy, the purer the data set
21 USTC KDD CUP TEAM
22. Examples of computing entropy 22 USTC KDD CUP TEAM
23. Information Gain(ID3) Select the attribute with the highest information gain
Based on the entropy:
Value(A)is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v 23 USTC KDD CUP TEAM
24. Information Gain: example USTC KDD CUP TEAM 24
25. Information Gain: problem 25 USTC KDD CUP TEAM
26. Information Gain: problem Tends to prefer splits that result in large number of partitions, each being small but pure.
We need not only gain in purity, but also retain generality.
26 USTC KDD CUP TEAM
27. Gain ratio(C4.5) Gain ratio: penalize attributes like date by incorporating split information (C4.5)
Split information is sensitive to how broadly and uniformly the attribute splits the data
27 USTC KDD CUP TEAM
28. GINI index 28 USTC KDD CUP TEAM
29. GINI index(CART) 29 USTC KDD CUP TEAM
30. Classification error 30 USTC KDD CUP TEAM
31. Comparison among impurity criteria 31 USTC KDD CUP TEAM
32. When to stop splitting? Basic idea: the partitioning terminates if any of the constraints is met:
all examples falling into a node belong to the same class
this node becomes a leaf whose label is the class
no attribute can be used to further partition the data
this node becomes a leaf whose label is the majority class of the examples falling into the node
no instance falling into a node
this node becomes a leaf whose label is the majority class of the examples falling into the parent of the node 32 USTC KDD CUP TEAM
33. C4.5: an example Construction: Simple depth-first
Impurity measure: information gain
Needs entire data to fit in memory
Unsuitable for Large Datasets
33 USTC KDD CUP TEAM
34. Some practical issues Underfitting and Overfitting
Missing values 34 USTC KDD CUP TEAM
35. Underfitting and Overfitting Underfitting: too simple model that causes both training and test errors are large
Overfitting: decision trees that are more complex than necessary
Reasons of overfitting:
bias training set
35 USTC KDD CUP TEAM
36. Overfitting: solution Occams Razor: prefer the simplest hypothesis that fits the data.
For complex models, there is a greater chance that it was fitted accidentally by errors in data, but a short tree is less likely to be a statistical coincidence.
Since when a decision tree is built, many branches may reflect anomalies in the training set due to noise or outliers, we can use pruning to address the problem of overfitting 36 USTC KDD CUP TEAM
37. How to prune decision tree? Two popular methods: prepruning and postpruning
Prepruning: Stop the algorithm before it becomes a fully-grown tree, we can design a more restrictive stopping conditions, but it sometimes hard to choose an appropriate threshold.
Postpruning: remove branches from a fully grown tree
in general, postpruning is more accurate than prepruning, yet requires more computational cost, the key is how to determine the correct final tree size
37 USTC KDD CUP TEAM
38. Enhance basic decision tree algorithm Continuous-valued attributes: Finding one or multiple splitting points to partition a continuous-valued attribute domain into a set of ranges, one branch a range
Handle missing attribute values: assign the most common value of the attribute or assign probability to each of the possible values
38 USTC KDD CUP TEAM
39. Enhance basic decision tree algorithm
SLIQ [Mehta et al., EDBT96]
build an index for each attribute. Only class list and the current attribute list reside in memory
SPRINT [J. Shafer et al., VLDB96]
construct an attribute list. When a node is partitioned, the attribute list is also partitioned
Rain Forest [J. Gehrke et al., VLDB98]
build an AVC-list (attribute, value, class label)
separate the scalability aspects from the criterion that determine the quality of the tree 39 USTC KDD CUP TEAM
40. Other enhancing methods Ensemble methods: construct a set of base classifiers and take a vote on predictions in classification
40 USTC KDD CUP TEAM
41. Summary 41 USTC KDD CUP TEAM
42. Summary Advantages:
Inexpensive to construct
Insensitive to errors in individual data
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is acceptable
No backtracking: may converge to a locally optimal solution
Mining only a small set of rules and testing only one attribute at one time.
Accuracy is not so high
42 USTC KDD CUP TEAM
43. Summary: accuracy 43 USTC KDD CUP TEAM
44. Summary: accuracy 44 USTC KDD CUP TEAM
45. References Jiawei Han, Micheline Kamber. Data mining: concepts and techniques. 2nd China Machine Press.
Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to data mining. Post&Telecom Press.
Xindong Wu, Vipin Kumar. The Top Ten Algorithms in Data Mining. Chapman&Hall.
Bing Liu. Web Data Mining. Tsinghua University Press.
Some slides made by Jiawei Han, Jian Pei, Bing Liu, Pang-Ning Tan, Hong Cheng, Zhi-hua Zhou, Xindong Wu. USTC KDD CUP TEAM 45
46. Selected Readings J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993
USTC KDD CUP TEAM 46
47. The end
[email protected] 47 USTC KDD CUP TEAM