1 / 79

Classification

Classification. Classification task. Input : a training set of tuples, each labeled with one class label Output : a model (classifier) that assigns a class label to each tuple based on the other attributes

Download Presentation

Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification

  2. Classification task • Input: a training set of tuples, each labeled with one class label • Output: a model (classifier) that assigns a class label to each tuple based on the other attributes • The model can be used to predict the class of new tuples, for which the class label is missing or unknown

  3. What is Classification • Data classification is a two-step process • first step: a model is built describing a predetermined set of data classes or concepts • second step: the model is used for classification • Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute • Data tuples are also referred to as samples, examples, or objects

  4. Train and test • The tuples (examples, samples) are divided into training set + test set • Classification model is built in two steps: • training - build the model from the training set • test - check the accuracy of the model using test set

  5. Train and test • Kind of models: • if - then rules • logical formulae • decision trees • Accuracy of models: • the known class of test samples is matched against the class predicted by the model • accuracy rate = % of test set samples correctly classified by the model

  6. Training step Classification algorithm training data Classifier (model) if age < 31 or Car Type =Sports then Risk = High

  7. Test step Classifier (model) test data

  8. Classification (prediction) Classifier (model) new data

  9. Classification vs. Prediction • There are two forms of data analysis that can be used to extract models describing data classes or to predict future data trends: • classification: predict categorical labels • prediction: models continuous-valued functions

  10. Comparing Classification Methods (1) • Predictive accuracy: this refers to the ability of the model to correctly predict the class label of new or previously unseen data • Speed: this refers to the computation costs involved in generating and using the model • Robustness: this is the ability of the model to make correct predictions given noisy data or data with missing values

  11. Comparing Classification Methods (2) • Scalability: this refers to the ability to construct the model efficiently given large amount of data • Interpretability: this refers to the level of understanding and insight that is provided by the model • Simplicity: • decision tree size • rule compactness • Domain-dependent quality indicators

  12. Problem formulation Given records in the database with class label – find model for each class. Age < 31 Car Type is sports High High Low

  13. Classification techniques • Decision Tree Classification • Bayesian Classifiers • Neural Networks • Statistical Analysis • Genetic Algorithms • Rough Set Approach • k-nearest neighbor classifiers

  14. Classification by Decision Tree Induction • A decision tree is a tree structure, where • each internal node denotes a test on an attribute, • each branch represents the outcome of the test, • leaf nodes represent classes or class distributions Age < 31 N Y Car Type is sports High High Low

  15. Decision Tree Induction (1) • A decision tree is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from one class. • Each non-leaf node of the tree contains a split point, which is a test on one or more attributes and determines how the data is partitioned

  16. Decision Tree Induction (2) • Basic algorithm: a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner. • Many variants: • from machine learning (ID3, C4.5) • from statistics (CART) • from pattern recognition (CHAID) • Main difference: split criterion

  17. Decision Tree Induction (3) • The algorithm consists of two phases: • Build an initial tree from the training data such that each leaf node is pure • Prune this tree to increase its accuracy on test data

  18. Tree Building • In the growth phase the tree is built by recursively partitioning the data until each partition is either "pure" (contains members of the same class) or sufficiently small. • The form of the split used to partition the data depends on the type of the attribute used in the split: • for a continuous attribute A, splits are of the form value(A)<x where x is a value in the domain of A. • for a categorical attribute A, splits are of the form value(A)X where Xdomain(A)

  19. Tree Building Algorithm Make Tree (Training Data T) { Partition(T) } Partition(Data S) { if (all points in S are in the same class) then return for each attribute A do evaluate splits on attribute A; use best split found to partition S into S1 and S2 Partition(S1) Partition(S2) }

  20. Tree Building Algorithm • While growing the tree, the goal at each node is to determine the split point that "best" divides the training records belonging to that leaf • To evaluate the goodness of the split some splitting indices have been proposed

  21. Split Criteria • Gini index (CART, SPRINT) • select attribute that minimize impurity of a split • Information gain (ID3, C4.5) • to measure impurity of a split use entropy • select attribute that maximize entropy reduction • 2 contingency table statistics (CHAID) • measures correlation between each attribute and the class label • select attribute with maximal correlation

  22. Gini index (1) Given a sample training set where each record represents a car-insurance applicant. We want to build a model of what makes an applicant a high or low insurance risk. Classifier (model) Training set The model built can be used to screen future insurance applicants by classifying them into the High or Low risk categories

  23. Gini index (2) SPRINT algorithm: Partition(Data S) { if (all points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split found to partition S into S1 and S2 Partition(S1); Partition(S2); } Initial call: Partition(Training Data)

  24. Gini index (3) • Definition: gini(S) = 1 - pj2 where: • S is a data set containing examples from n classes • pj is a relative frequency of class j in S • E.g. two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. ppos= p/(p+n) pneg = n/(n+p) gini(S) = 1 - ppos2 - pneg2

  25. Gini index (4) • If dataset S is split into S1 and S2, then splitting index is defined as follows: giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) + (p2+ n2)/(p+n)* gini(S2) where p1, n1 (p2, n2) denote p1 Pos-elements and n1 Neg-elements in the dataset S1 (S2), respectively. • In this definition the "best" split point is the one with the lowest value of the giniSPLIT index.

  26. Example (1) Training set

  27. Example (1) Attribute list for ‘Age’ Attribute list for ‘Car Type’

  28. Example (2) • Possible values of a split point for Age attribute are: Age17, Age20, Age23, Age32, Age43, Age68 • G(Age<=17) = 1- (12+02) = 0 • G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25 • GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5

  29. Example (3) • G(Age<=20) = 1- (12+02) = 0 • G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2 • GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3 • G(Age23) = 1- (12+02) = 0 • G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9 • GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9

  30. Example (4) • G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8 • G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2 • GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24 The lowest value of GSPLIT is for Age23, thus we have a split point at Age=(23+32) / 2 = 27.5

  31. Example (5) Decision tree after the first split of the example set: Age27.5 Age>27.5 Risk = High Risk = Low

  32. Example (6) Attribute lists are divided at the split point. Attribute lists for Age27.5: Attribute lists for Age>27.5

  33. Example (7) Evaluating splits for categorical attributes We have to evaluate splitting index for each of the 2N combinations, where N is the cardinality of the categorical attribute. G(Car type {sport}) = 1 – 12 – 02 = 0 G(Car type {family}) = 1 – 02 – 12 = 0 G(Car type {truck}) = 1 – 02 – 12 = 0

  34. Example (8) G(Car type  { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2 G(Car type  { sport, truck }) = 1/2 G(Car type  { family, truck }) = 1 - 02 - 12 = 0 GSPLIT(Car type  { sport }) = (1/3) * 0 + (2/3) * 0 = 0 GSPLIT(Car type  { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3 GSPLIT(Car type  { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3 GSPLIT(Car type  { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3 GSPLIT(Car type  { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3 GSPLIT(Car type  { family, truck }) = (2/3)*0+(1/3)*0=0

  35. Example (9) The lowest value of GSPLIT is for Car type  {sport}, thus this is our split point. Decision tree after the second split of the example set: Age27.5 Age>27.5 Risk = High Car type  {family, truck} Car type  {sport} Risk = High Risk = Low

  36. Information Gain (1) • The information gain measure is used to select the test attribute at each node in the tree • The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node • This attribute minimizes the information needed to classify the samples in the resulting partitions

  37. Information Gain (2) • Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m classes, Ci (for i=1, ..., m) • Let si be the number of samples of S in class Ci • The expected information needed to classify a given sample is given by I(s1, s2, ..., sm) = -  pi log2(pi) where pi is the probability that an arbitrary sample belongs to class Ci and is estimated by si/s.

  38. Information Gain (3) • Let attribute A have v distinct values, {a1, a2, ..., av}. Attribute A can be used to partition S into {S1, S2, ..., Sv}, where Sj contains those samples in S that have value aj of A • If A were selected as the test attribute, then these subsets would correspond to the branches grown from the node containing the set S

  39. Information Gain (4) • Let sij be the number of samples of the class Ci in a subset Sj. The entropy, or expected information based on the partitioning into subsets by A, is given by: E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s* * I(s1j, s2j, ..., smj) • The smaller the entropy value, the greater the purity of the subset partitions.

  40. Information Gain (5) • The term (s1j + s2j +...+smj)/s acts as the weight of the jth subset and is the number of samples in the subset (i.e. having value aj of A) divided by the total number of samples in S. Note that for a given subset Sj, I(s1j, s2j, ..., smj) = -  pij log2(pij) where pij = sij/|Sj| and is the probability that a sample in Sj belongs to class Ci

  41. Information Gain (6) The encoding information that would be gained by branching on A is Gain(A) = I(s1, s2, ..., sm) – E(A) Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A

  42. Example (1)

  43. Example (2) • Let us consider the following training set of tuples taken from the customer database. • The class label attribute, buys_computer, has two distinct values (yes, no), therefore, there are two classes (m=2). C1 correspond to yes – s1 = 9 C2 correspond to no - s2 = 5 I(s1, s2)=I(9, 5)= - 9/14log29/14 – 5/14log25/14=0.94

  44. Example (3) • Next, we need to compute the entropy of each attribute. Let start with the attribute age for age=‘<=30’ s11=2 s21=3 I(s11, s21) = 0.971 for age=’31..40’ s12=4 s22=0 I(s12, s22) = 0 for age=‘>40’ s13=2 s23=3 I(s13, s23) = 0.971

  45. Example (4) The entropy of age is, E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) + + 5/14* I(s13, s23) = 0.694 The gain in information from such a partitioning would be: Gain(age) = I(s1, s2) – E(age) = 0.246

  46. Example (5) • We can compute Gain(income)=0.029, Gain(student)=0.151, and Gain(credit_rating)=0.048 Since age has the highest information gain amont the attributes, it is selected as the test atribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values.

  47. Example (6) age >40 <=30 31..40 buys_computers: yes, no buys_computers: yes, no buys_computers: yes

  48. Example (7) age <=30 >40 31..40 credit_rating student yes yes excellent fair no yes no yes no

  49. Entropy vs. Gini index • Entropy tends to fin groups of classes that add up to 50% of the data • Gini index tends to isolate the largest class from all other classes class A 40 class B 30 class C 20 class D 10 class A 40 class B 30 class C 20 class D 10 if age < 65 if age < 40 no yes yes no class A 40 class D 10 class B 30 class C 20 class B 30 class C 20class D 10 class A 40

  50. Tree pruning • When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. • Tree pruning methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data

More Related