1 / 14

Classification

Classification. supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies. SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.)

gyula
Download Presentation

Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification supplemental

  2. Scalable Decision Tree Induction Methods in Data Mining Studies • SLIQ (EDBT’96 — Mehta et al.) • builds an index for each attribute and only class list and the current attribute list reside in memory • SPRINT (VLDB’96 — J. Shafer et al.) • constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim) • integrates tree splitting and tree pruning: stop growing the tree earlier • RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) • separates the scalability aspects from the criteria that determine the quality of the tree • builds an AVC-list (attribute, value, class label)

  3. SPRINT For large data sets. Age < 25 Car = Sports H H L

  4. Gini Index (IBM IntelligentMiner) • If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as • The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

  5. SPRINT Partition (S) if all points of S are in the same class return; else for each attribute A do evaluate_splits on A; use best split to partition into S1,S2; Partition(S1); Partition(S2);

  6. SPRINT Data Structures Training set Age Car Attribute lists

  7. Splits Age < 27.5 Group2 Group1

  8. Histograms For continuous attributes Associated with node (Cabove, Cbelow) to process already processed

  9. ginisplit0 = 0/6 gini(S1) + 6/6 gini(S2) gini(S2) = 1 - [(4/6)2 +(2/6)2 ] = 0.444 ginisplit1 = 1/6 gini(S1) +5/6 gini(S2) gini(S1) = 1 - [(1/1) 2 ] = 0 gini(S2) = 1 - [(3/4)2 +(2/4)2 ] = 0.1875 ginisplit2 = 2/6 gini(S1) +4/6 gini(S2) gini(S1) = 1 - [(2/2) 2 ] = 0 gini(S2) = 1 - [(2/4)2 +(2/4)2 ] = 0.5 ginisplit3 =3/6 gini(S1) +3/6 gini(S2) gini(S1) = 1 - [(3/3) 2 ] = 0 gini(S2) = 1 - [(1/3)2 +(2/3)2 ] = 0.444 ginisplit4 =4/6 gini(S1) +2/6 gini(S2) gini(S1) = 1 - [(3/4) 2 +(1/4) 2 ] = 0.375 gini(S2) = 1 - [(1/2)2 +(1/2)2 ] = 0.5 ginisplit5 =5/6 gini(S1) +1/6 gini(S2) gini(S1) = 1 - [(4/5) 2 +(1/5) 2 ] = 0.320 gini(S2) = 1 - [(1/1)2 ] = 0 ginisplit5 =6/6 gini(S1) +0/6 gini(S2) gini(S1) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.320 Example ginisplit0 = 0.444 ginisplit1= 0.156 ginisplit2= 0.333 ginisplit3= 0.222 ginisplit4= 0.416 ginisplit5= 0.222 Age <= 18.5 ginisplit6= 0.444

  10. Splitting categorical attributes Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value

  11. ginisplit(family)= 3/6 gini(S1) + 3/6 gini(S2) gini(S1) = 1 - [(2/3)2 + (1/3)2] = 4/9 gini(S2) = 1- [(2/3)2 + (1/3)2] = 4/9 ginisplit((sports)= 2/6 gini(S1) + 4/6 gini(S2) gini(S1) = 1 - [(2/2)2] = 0 gini(S2) = 1- [(2/4)2 + (2/4)2] = 0.5 ginisplit(truck)= 1/6 gini(S1) + 5/6 gini(S2) gini(S1) = 1 - [(1/1)2] = 0 gini(S2) = 1- [(4/5)2 + (1/5)2] = 0.32 Example ginisplit(family)= 0.444 ginisplit((sports) )= 0.333 ginisplit(truck) )= 0.266 Car Type = Truck

  12. Example (2 attributes) The winner is Age <= 18.5 Y N H

  13. Example for Bayes Rules • The patient either has a cancer or does not. • A prior knowledge: over the entire population, .008 have cancer • Lab test result + or - is imperfect. It returns • a correct positive result in only 98% of the cases in which the cancer is actually present • a correct negative result in only 97% of the cases in which the cancer is not present • What happens if a new patient for whom the lab test returns +?

  14. Example for Bayes Rules Pr(cancer)=0.008 Pr(not cancer)=0.992 Pr(+|cancer)=0.98 Pr(-|cancer)=0.02 Pr(+|not cancer)=0.03 Pr(-|not cancer)=0.97 Pr(+|cancer)p(cancer) = 0.98* 0.008 = 0.0078 Pr(+|not cancer)Pr(not cancer) = 0.03*0.992=0.0298 Hence, Pr(cancer|+) = 0.0078/(0.0078+0.0298)=0.21

More Related