1 / 54

Classification I

Classification I. Today. What is classification Overview of classification methods Decision trees. Predictive data mining. Our task Input: data representing objects that have been assigned labels Goal: accurately predict labels for new (previously unseen) objects.

novia
Download Presentation

Classification I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification I

  2. Today • What is classification • Overview of classification methods • Decision trees

  3. Predictive data mining • Our task • Input: data representing objects that have been assigned labels • Goal: accurately predict labels for new (previously unseen) objects

  4. Features (attributes) An example: email classification Examples (observations)

  5. Decision tree Spam = yes Spam = yes Spam = no

  6. Rules Spam = no Spam = yes Spam = yes Spam = no Spam = no

  7. Forests

  8. Classification • What is the class of the following e-mail? • No Caps: Yes • No. excl. marks: 0 • Missing date: Yes • No. digits in From: 4 • Image fraction: 0.3

  9. Classification • Classification: • predicts categorical class labels • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute • uses the model for classifying new data • Typical Applications • credit approval • target marketing • medical diagnosis • treatment effectiveness analysis

  10. Why Classification? A motivating application • Credit approval • A bank wants to classify its customers based on whether they are expected to pay back their approved loans • The history of past customers is used to train the classifier • The classifier provides rules, which identify potentially reliable future customers

  11. Why Classification? A motivating application • Credit approval • Classification rule: • If age = “31...40” and income = high then credit_rating= excellent • Future customers • Paul: age = 35, income = high  excellent credit rating • John: age = 20, income = medium  fair credit rating

  12. Evaluation of classification models • Counts of test records that are correctly (or incorrectly) predicted by the classification model • Confusion matrix Predicted Class Actual Class

  13. Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction: training set • The model is represented as classification rules, decision trees, or mathematical formulas

  14. Classification—A Two-Step Process • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test samples is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur

  15. Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IFLDL = ‘high’ ORGluc > 6 mmol/lit THENHeart attack = ‘yes’

  16. Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction Accuracy=? (Jeff, high, 7.5) Heart attack?

  17. Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution

  18. Classification by Decision Tree Induction • Decision tree generation consists of two phases: • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: • Classifying an unknown sample • Test the attribute values of the sample against the decision tree

  19. Training Dataset Example

  20. Output: A Decision Tree for “buys_computer” age? overcast <=30 >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes

  21. Design issues • How should the training records be split? • How should the splitting procedure stop?

  22. Splitting methods • Binary attributes

  23. Splitting methods • Nominal attributes

  24. Splitting methods • Ordinal attributes

  25. Splitting methods • Continuous attributes

  26. + – – – – – + – + + – – – + + + – – – + – – – + – – – – – – – + + – + – + + – + – – – + + – + + + – + – – – – + – – – – – + + – + – – + + + – – Choice of attribute – + – – – – – + + – + – – + + + – –

  27. Algorithm for Decision Tree InductionHunt’s algorithm • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Samples are partitioned recursively based on selected attributes • Test (split) attributes are selected on the basis of a heuristic or impurity measure (e.g., information gain)

  28. Algorithm for Decision Tree Induction • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left

  29. Algorithm for Decision Tree Induction (pseudocode) Algorithm GenDecTree(Sample S, Attlist A) • create a node N • If all samples are of the same class C then label N with C; terminate; • If A is empty then label N with the most common class C in S (majority voting); terminate; • Select aA, with the highest information gain; Label N with a; • For each value v of a: • Grow a branch from N with condition a=v; • Let Sv be the subset of samples in S with a=v; • If Sv is empty then attach a leaf labeled with the most common class in S; • Else attach the node generated by GenDecTree(Sv, A-a)

  30. Attribute Selection Measure: Information Gain • Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| • - where Ci, Ddenotes the set of tuples that belong to class Ci • Expected information (entropy) needed to classify a tuple in D: • - where m is the number of classes

  31. Attribute Selection Measure: Information Gain • Informationneeded (after using A to split D into v partitions) to classify D: • Information gained by branching on attribute A

  32. Class P: buys_computer = “yes” Class N: buys_computer = “no” Attribute Selection: Information Gain samples:yes no

  33. Splitting the samples using age age? >40 <=30 30...40 labeled yes

  34. Giniindex • If a data set D contains examples from n classes, gini index, gini(D) is defined as - where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as

  35. Giniindex • Reduction in Impurity: • The attribute that provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node

  36. Comparing Attribute Selection Measures • The two measures, in general, return good results but • Both are biased towards multivalued attributes • Gini may have difficulty when # of classes is large • Gini tends to favor test sets that result in equal-sized partitions and purity in both partitions

  37. Is minimizing impurity/ maximizing Δ enough? • Information gain favors attributes with large number of values • A test condition with large number of outcomes may not be desirable • # of records in each partition is too small to make predictions

  38. Gain ratio • Amodification that reduces its bias on high-branch attributes • Gain ratio should be • Large when data is evenly spread • Small when all data belongs to one branch • Takes number and size of branches into account when choosing an attribute • It corrects Δ by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)

  39. Gain ratio • Gain ratio = Δ/Splitinfo • SplitInfo = -Σi=1…kP(vi)log(P(vi)) • k: total number of splits • P(vi): probability that an instance belongs to a branch • If each attribute has the same number of records, • P(vi) = 1/kand SplitInfo = logk • Large number of splits  largeSplitInfo small gain ratio

  40. Decision boundary for decision trees • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary in decision trees is parallel to axes because test condition involves a single attribute at-a-time

  41. Oblique Decision Trees ? Not all datasets can be partitioned optimally using test conditions using single attributes!

  42. Oblique Decision Trees Test on multiple attributes If x+y< 1 then red class Not all datasets can be partitioned optimally using test conditions using single attributes!

  43. Oblique Decision Trees 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) >1or sqrt(x12+x22) < 0.5

  44. Overfitting due to noise Decision boundary is distorted by noise point

  45. Overfitting due to insufficient samples x: class 1 : : class 2 o: test samples Why?

  46. Overfitting due to insufficient samples x: class 1 : : class 2 o: test samples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

  47. Overfitting and Tree Pruning • Overfitting: An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples • Two approaches to avoid overfitting

  48. Overfitting and Tree Pruning • Two approaches to avoid overfitting • Prepruning: Halt tree construction early • do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Postpruning: Remove branches from a “fully grown” tree • get a sequence of progressively pruned trees

  49. Pros and Cons of decision trees • Cons • Cannot handle complicated relationships between features • simple decision boundaries • problems with lots of missing data • NP-complete • Pros • Reasonable training time • Fast application • Easy to interpret • Easy to implement • Can handle large number of features

  50. Some well-known decisiontree learning implementations CARTBreiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth ID3Quinlan JR (1986) Induction of decision trees. Machine Learning 1:81–106 C4.5 Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann J48 Implementation of C4.5 in WEKA

More Related