00:00

Basics of Data Mining Classification Techniques

This content discusses the fundamental concepts of data mining classification, covering topics such as the definition of classification, the importance of classification models in predictive and descriptive analysis, and examples of classification tasks. It explains the general framework for building classification models, including learning algorithms, induction, deduction, and evaluation metrics. Various classification techniques like decision tree, rule-based methods, nearest-neighbor, Naïve Bayes, support vector machines, and neural networks are also highlighted.

arramdani
Download Presentation

Basics of Data Mining Classification Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr http://www3.yildiz.edu.tr/~naydin 1

  2. Data Mining Classification Basic Concepts and Techniques • Outline – Basic Concepts – General Framework for Classification – Decision Tree Classifier – Characteristics of Decision Tree Classifiers – Model Overfitting – Model Selection – Model Evaluation – Presence of Hyper-parameters – Pitfalls of Model Selection and Evaluation – Model Comparison 2

  3. Classification: Definition • Given a collection of records (training set ) – Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the class label • x: attribute, predictor, independent variable, input • y: class, response, dependent variable, output • Classification task: – Learn a model that maps each attribute set x into one of the predefined class labels y – A classification model is an abstract representation of the relationship between the attribute set and the class label. 3

  4. Classification: Definition • A classification model serves two important roles in data mining: – it is used as a predictive model to classify previously unlabeled instances • A good classification model must provide accurate predictions with a fast response time – it serves as a descriptive model to identify the characteristics that distinguish instances from different classes • Useful for critical applications, such as medical diagnosis, where it is insufficient to have a model that makes a prediction without justifying how it reaches such a decision 4

  5. Examples of Classification Task Attribute set, x Task Class label, y Categorizing email messages (Spam filtering) Features extracted from email message header and content spam or non-spam Binary Identifying tumor cells Features extracted from x-rays or MRI scans malignant or benign cells Binary Cataloging galaxies Features extracted from telescope images Elliptical, spiral, or irregular-shaped galaxies Multiclass 5

  6. Example-Vertebrate Classification • A sample data set for classifying vertebrates into mammals, reptiles, birds, fishes, and amphibians. – The attribute set includes characteristics of the vertebrate such as its body temperature, skin cover, and ability to fly. 6

  7. Example-Loan Borrower Classification • A sample data set for the problem of predicting whether a loan borrower will repay the loan or default on the loan payments. – The attribute set includes personal information of the borrower such as marital status and annual income, while the class label indicates whether the borrower had defaulted on the loan payments. 7

  8. General Framework for Building Classification Model • Classification is the task of assigning labels to unlabeled data instances – a classifier is used to perform such a task • The model is created using a given a set of instances, known as the training set, which contains attribute values as well as class labels for each instance. • Learning algorithm – the systematic approach for learning a classification model given a training set 8

  9. General Framework for Building Classification Model • Induction – The process of using a learning algorithm to build a classification model from the training data. – AKA learning a model or building a model. • Deduction – Process of applying a classification model on unseen test instances to predict their class labels. • Process of classification involves two steps: – applying a learning algorithm to training data to learn a model, – applying the model to assign labels to unlabeled instances • A classification technique refers to a general approach to classification 9

  10. General Framework for Building Classification Model • the induction and deduction steps should be performed separately. • the training and test sets should be independent of each other to ensure that the induced model can accurately predict the class labels of instances it has never encountered before. 10

  11. General Framework for Building Classification Model • Models that deliver such predictive insights are said to have good generalization performance. • The performance of a model (classifier) can be evaluated by comparing the predicted labels against the true labels of instances. • This information can be summarized in a table called a confusion matrix. – Each entry fijdenotes the number of instances from class i predicted to be of class j. • number of correct predictions: • number of incorrect predictions: f10+ f01 f11+ f00 11

  12. Classification Performance Evaluation Metrics • The learning algorithms of most classification techniques are designed to learn models that attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set 12

  13. Classification Techniques • Base Classifiers – Decision Tree based Methods – Rule-based Methods – Nearest-neighbor – Naïve Bayes and Bayesian Belief Networks – Support Vector Machines – Neural Networks, Deep Neural Nets • Ensemble Classifiers – Boosting, Bagging, Random Forests 13

  14. Decision Tree Classifier • solve a classification problem by asking a series of carefully crafted questions about the attributes of the test instance. • The series of questions and their possible answers can be organized into a hierarchical structure called a decision tree • The tree has three types of nodes: – A root node, • with no incoming links and zero or more outgoing links – Internal nodes, • each of which has exactly one incoming link and two or more outgoing links. – Leaf or terminal nodes, • each of which has exactly one incoming link and no outgoing links. 14

  15. Decision Tree Classifier • Every leaf node in the decision tree is associated with a class label • The non-terminal nodes, which include the root and internal nodes, contain attribute test conditions that are typically defined using a single attribute. • Each possible outcome of the attribute test condition is associated with exactly one child of this node. 15

  16. Decision Tree - Example • A decision tree for the mammal classification problem – the root node of the tree here uses the attribute Body Temperature to define an attribute test condition that has two outcomes, warm and cold, resulting in two child nodes. 16

  17. Decision Tree - Example • Classifying an unlabelled vertebrate. – The dashed lines represent the outcomes of applying various attribute test conditions on the unlabeled vertebrate. – The vertebrate is eventually assigned to the Non- mammals class. 17

  18. Decision Tree - Example Splitting Attributes Home Owner Marital Status Annual Income Defaulted Borrower ID 1 Yes Single 125K No Home Owner 2 No Married 100K No Yes No 3 No Single 70K No NO MarSt 4 Yes Married 120K No Married 5 No Divorced 95K Yes Single, Divorced 6 No Married 60K No Income NO 7 Yes Divorced 220K No < 80K > 80K 8 No Single 85K Yes 9 No Married 75K No YES NO 10 No Single 90K Yes 10 Model: Decision Tree Training Data 18

  19. Apply Model to Test Data Test Data Start from the root of tree. Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? Home Owner 10 Yes No NO MarSt Married Single, Divorced Income NO < 80K > 80K YES NO 19

  20. Apply Model to Test Data Test Data Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? Home Owner 10 Yes No NO MarSt Married Single, Divorced Income NO < 80K > 80K YES NO 20

  21. Apply Model to Test Data Test Data Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? Home Owner 10 Yes No NO MarSt Married Single, Divorced Income NO < 80K > 80K YES NO 21

  22. Apply Model to Test Data Test Data Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? Home Owner 10 Yes No NO MarSt Married Single, Divorced Income NO < 80K > 80K YES NO 22

  23. Apply Model to Test Data Test Data Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? Home Owner 10 Yes No NO MarSt Married Single, Divorced Income NO < 80K > 80K YES NO 23

  24. Apply Model to Test Data Test Data Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? Home Owner 10 Yes No NO MarSt Assign Defaulted to “No” Married Single, Divorced Income NO < 80K > 80K YES NO 24

  25. Another Example of Decision Tree Single, Divorced MarSt Married Home Owner Marital Status Annual Income Defaulted Borrower ID NO Home Owner 1 Yes Single 125K No No Yes 2 No Married 100K No NO Income 3 No Single 70K No < 80K > 80K 4 Yes Married 120K No 5 No Divorced 95K Yes YES NO 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes There could be more than one tree that fits the same data! 9 No Married 75K No 10 No Single 90K Yes 10 25

  26. Decision Tree Classification Task Tree Induction algorithm Attrib1 Attrib2 Attrib3 Class Tid No 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K Induction No 4 Yes Medium 120K Yes 5 No Large 95K No 6 No Medium 60K Learn Model No 7 Yes Large 220K Yes 8 No Small 85K No 9 No Medium 75K Yes 10 No Small 90K Model 10 Training Set Apply Model Decision Tree Attrib1 Attrib2 Attrib3 Class Tid ? 11 No Small 55K ? 12 Yes Medium 80K Deduction ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K 10 Test Set 26

  27. Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT • These algorithms employ a greedy strategy to grow the decision tree in a top-down fashion – by making a series of locally optimal decisions about which attribute to use when partitioning the training data 27

  28. General Structure of Hunt’s Algorithm • Let Dtbe the set of training records that reach a node t • General Procedure: – If Dtcontains records that belong the same class yt, then t is a leaf node labeled as yt – If Dtcontains records that belong to more than one class, use an attribute test to split the data into smaller subsets. – Recursively apply the procedure to each subset. Home Owner Marital Status Annual Income Defaulted Borrower ID 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Dt ? 28

  29. Hunt’s Algorithm Home Owner Marital Status Annual Income Defaulted Borrower ID 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No (7,3) 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No (3,0) (4,3) 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 (3,0) (3,0) (3,0) (3,0) (1,3) (1,0) (0,3) 29

  30. Hunt’s Algorithm Home Owner Marital Status Annual Income Defaulted Borrower ID Home Owner 1 Yes Single 125K No Yes No Defaulted = No 2 No Married 100K No Defaulted = No (3,0) Defaulted = No (4,3) 3 No Single 70K No (7,3) 4 Yes Married 120K No (a) (b) 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No Home Owner 8 No Single 85K Yes 9 No Married 75K No Yes No Home Owner 10 No Single 90K Yes Defaulted = No Marital Status Yes No 10 (3,0) Single, Divorced Married Marital Status Defaulted = No (3,0) Defaulted = No (3,0) Annual Income Single, Divorced Married < 80K >= 80K Defaulted = No Defaulted = Yes Defaulted = No (1,0) Defaulted = Yes (0,3) (1,3) (3,0) (c) (d) 30

  31. Design Issues of Decision Tree Induction • How should training records be split? – Method for expressing test condition • depending on attribute types – Measure for evaluating the goodness of a test condition • How should the splitting procedure stop? – Stop splitting if all the records belong to the same class or have identical attribute values – Early termination 31

  32. Methods for Expressing Test Conditions • Decision tree induction algorithms must provide a method for expressing an attribute test condition and its corresponding outcomes for different attribute types – Binary Attributes – Nominal Attributes – Ordinal Attributes – Continuous Attributes 32

  33. Test Condition for Binary Attributes • generates two potential outcomes 33

  34. Test Condition for Nominal Attributes • Multi-way split: – Use as many partitions as distinct values. Marital Status • Binary split: – Divides values into two subsets Single Divorced Married Marital Status Marital Status Marital Status OR OR {Married} {Single} {Married, Divorced} {Single, Divorced} {Single, Married} {Divorced} 34

  35. Test Condition for Ordinal Attributes • Multi-way split: – Use as many partitions as distinct values • Binary split: – Divides values into two subsets • some decision tree algorithms, such as CART (Classification & Regression Trees), produce only binary splits by considering all 2k−1 - 1 ways of creating a binary partition of k attribute values. – Preserve order property among attribute values Shirt Size Small Extra Large Medium Large Shirt Size Shirt Size {Small, Medium} {Large, Extra Large} {Medium, Large, Extra Large} {Small} Shirt Size This grouping violates order property {Small, Large} {Medium, Extra Large} 35

  36. Test Condition for Continuous Attributes • For continuous attributes, the attribute test condition can be expressed as a comparison test (e.g., A < v) producing a binary split, or as a range query of the form vi≤ A < vi+1, for i = 1, . . . , k, producing a multiway split. Annual Income > 80K? Annual Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split 36

  37. Splitting Based on Continuous Attributes • Different ways of handling – Discretization to form an ordinal categorical attribute • Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. – Static • discretize once at the beginning – Dynamic • repeat at each node • Binary Decision: (A < v) or (A  v) – consider all possible splits and finds the best cut – can be more compute intensive 37

  38. How to determine the Best Split • There are many measures that can be used to determine the goodness of an attribute test condition. – These measures try to give preference to attribute test conditions that partition the training instances into purer subsets in the child nodes, • which mostly have the same class labels. • Having purer nodes is useful since a node that has all of its training instances from the same class does not need to be expanded further. 38

  39. How to determine the Best Split • Before Splitting: – 10 records of class 0, – 10 records of class 1 • Which test condition is the best? Car Type Customer ID Gender Family Luxury c1 c20 No Yes c10 c11 Sports ... ... C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 C0: 0 C1: 1 39

  40. How to determine the Best Split • Greedy approach: – Nodes with purer class distribution are preferred • Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 High degree of impurity Low degree of impurity 40

  41. Measures of Node Impurity • Entropy ?−1 ??????? = − ??? ???2??(?) ?=0 – where ??(?) is the frequency of class ? at node t, and ? is the total number of classes • Gini Index ???? ????? = 1 − ?−1 ???2 ?=0 • Misclassification error ?????????????? ????? = 1 − max[??(?)] 41

  42. Measures of Node Impurity • All three measures give a zero impurity value if a node contains instances from a single class and maximum impurity if the node has equal proportion of instances from multiple classes. • Relative magnitude of the impurity measures when applied to binary classification problems. – Since there are only two classes, p0(t)+p1(t) = 1. 42

  43. Measures of Node Impurity • The following examples illustrate how the values of the impurity measures vary as we alter the class distribution. 43

  44. Collective Impurity of Child Nodes • Consider an attribute test condition that splits a node containing N training instances into k children, {v1, v2, · · · , vk}, where every child node represents a partition of the data resulting from one of the k outcomes of the attribute test condition 44

  45. Finding the Best Split 1. Compute impurity measure (P) before splitting 2. Compute impurity measure (M) after splitting – Compute impurity measure of each child node – M is the weighted impurity of child nodes 3. Choose the attribute test condition that produces the highest gain Gain = P - M or equivalently, lowest impurity measure after splitting (M) 45

  46. Measure of Impurity: GINI • Gini Index for a given node ? ?−1 ???2 ???? ????? = 1 − ?=0 • where ??(?) is the frequency of class ? at node ?, and ? is the total number of classes – Maximum of 1−1/? when records are equally distributed among all classes, implying the least beneficial situation for classification – Minimum of 0 when all records belong to one class, implying the most beneficial situation for classification – Gini index is used in decision tree algorithms such as • CART (Classification & Regression Trees) • SLIQ (Supervised Learning in Quest) • SPRINT (Scalable Parallelizable Induction of Decision Tree) 46

  47. Measure of Impurity: GINI • Gini Index for a given node ? ?−1 ???2 ???? ????? = 1 − ?=0 – For 2-class problem (p, 1–p): • GINI = 1 – p2– (1 – p)2= 2p (1– p) C1 C2 Gini=0.278 1 5 C1 C2 Gini=0.444 2 4 C1 C2 Gini=0.500 3 3 C1 C2 Gini=0.000 0 6 47

  48. Computing Gini Index of a Single Node • Gini Index for a given node ? ?−1 ???2 ???? ????? = 1 − ?=0 C1 C2 0 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2= 1 – 0 – 1 = 0 C1 C2 1 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2= 0.278 P(C1) = 2/6 P(C2) = 4/6 C1 C2 2 4 Gini = 1 – (2/6)2 – (4/6)2= 0.444 48

  49. Computing Gini Index of a Collection of Nodes • When a node ? is split into ? partitions (children) ? ?? ?????(?) ?????????= ?=1 where, ??= number of records at child ?, ? = number of records at parent node ?. 49

  50. Binary Attributes: Computing GINI Index • Splits into two partitions (child nodes) • Effect of Weighing partitions: – Larger and purer partitions are sought Parent 7 5 B? C1 C2 Gini = 0.486 Yes No Node N1 Node N2 Gini(N1) = 1 – (5/6)2 – (1/6)2 = 0.278 N1 N2 5 1 Weighted Gini of N1 N2 = 6/12 * 0.278 + 6/12 * 0.444 = 0.361 Gain = 0.486 – 0.361 = 0.125 C1 C2 Gini=0.361 2 4 Gini(N2) = 1 – (2/6)2 – (4/6)2 = 0.444 50

More Related