1 / 31

Unsupervised Learning with Mixed Numeric and Nominal Data

Unsupervised Learning with Mixed Numeric and Nominal Data. Advisor : Dr. Hsu Graduate : Yu-Cheng Chen Authors : Cen Li, Gautam Biswas. 2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. Outline. Motivation Objective Introduction Background SABC

nuncio
Download Presentation

Unsupervised Learning with Mixed Numeric and Nominal Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Learning with MixedNumeric and Nominal Data Advisor : Dr. Hsu Graduate : Yu-Cheng Chen Authors :Cen Li, Gautam Biswas 2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

  2. Outline • Motivation • Objective • Introduction • Background • SABC • Experimental results • Conclusions • Personal Opinion

  3. Motivation • Tradition clustering algorithms assume feature are either numeric or categorical valued. • Majority of the useful data is described by numeric and nominal valued features

  4. Objective • Developing unsupervised learning techniques that exhibit good performance with mixed data.

  5. Introduction • Traditional approaches that be used to resolve mixed data have listed as following: • Binary encoding. • Discretize numeric attributes. • Generalize criterion functions to handle mixed data.

  6. Background • COBWEB/3 • use CU measure for categorical attributes • For numeric attributes

  7. Background (cont.) • COBWEB/3 • CU measure for numeric attributes is defined as: • The overall CU is defined as:

  8. Background (cont.) • COBWEB • Limitations: • The normal distribution assumption for numeric data. • The accuracy of the estimate is suspect when sample size is samll • When objects in Ck has a unique value, the σik = 0 and 1/ σik →∞ , so we set the 1/ σik =1 whenσik =1 < 1

  9. Background (cont.) • ECOBWEB want to remedy the disadvantages of COBWEB/3 • The normal distribution assumption • When σik = 0

  10. Background (cont.) • ECOBWEB • Limitations: • The choice of the parameters has a significant effect on CU computation.

  11. Background (cont.) • AUTOCLASS • Use Bayesian method to clustering • Derive the most probable class distribution for the data given prior information. • Limitations: • Computational complexity is too high. • Over fitting problem.

  12. SBAC System • SBAC • uses a similarity measure defined by Goodall • adopts a hierarchical agglomerative approach to build partition structures. • The similarity is decided by • The uncommonality of feature value matches. • X1= {a, b} , X2={a, b}, X3={c, d} , X4={c, d} • ( P(a) =P(b) ) >= ( P(c)=P(d) ) • The similarity of X3 and X4 should be greater than that of X1 and X2.

  13. SBAC System • Summary • For numeric feature values, the similarity takes on: • The feature value difference • The uniqueness of the feature value pair

  14. SBAC System • Computing Similarity for numeric Attributes • We define the More Similar Feature Segment Set (MSFSS) • The set of all pairs of values for feature that are equally or more similar to the pair ( (Vi)k, (Vj)k ).

  15. SBAC System • The probability of picking two pair having a values (Vl)k, (Vm)kMSFVS ((Vi)k ,(Vj)k) is defined as • The dissimilairty of the pair (Dij)k is defined as the summation of the probabilities. • The similarity of the pair ((Vi)k ,(Vj)k is defined as

  16. SBAC System • For nominal feature values, the similarity is • We define the More Similar Feature Value Set (MSFVS) • The set of all pairs of values for feature that are equally or more similar to the pair ( (Vi)k, (Vi)k ). f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a) ,(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

  17. SBAC System • The probability of picking a pair (Vl)k, (Vl)kMSFVS ((Vi)k) is defined as following • The dissimilairty of the pair (Dii)k is defined as the summation of the probabilities

  18. SBAC System f(a)=3, f(b)=3, f(c)=4 MSFVS(c, c)={ (a, a) ,(b, b), (c, c)} MSFVS(b, b)={ (a, a), (b, b)

  19. SBAC System • Aggregating Similarity from Multiple Features • Assuming the results are expressed as Fisher’s χ2 • For numeric features: • For nominal features:

  20. SBAC System • Combining the two types of features: • {c, 9} • {a, 7.5} • {c, 10.5} • 8 {c, 9}

  21. SBAC System • The agglomerative clustering algorithm:

  22. SBAC System • The predefined threshold t • We set t=0.3 * D(root), D(root)=0.876, t=0.263 • If the dissimilarity is dropping larger than t, then stop

  23. Experimental results • Artificial data • 180 data points, three classes, G1, G2, G3 • Two nominal and two numeric attributes. • Each classes has 60 data points.

  24. Experimental results (cont.)

  25. Experimental results(cont.) COBWEB SBAC AUTOCLASS ECOBWEB

  26. Experimental results (cont.) • Real data • Hand Written Character (8OX) Data • Numeric features • 45 objects • Mushroom Data • Nominal features • 200 objects (100 of them were poisonous) • Heart disease Data • Mixed features • 303 patients

  27. Experimental results (cont.) • Results

  28. Experimental results (cont.) • Results

  29. Experimental results (cont.) • Results

  30. Conclusions • This paper proposed a new similarity measure that assigns greater weight to feature value matches that are uncommon in the population. • The approach has better performance in clustering than another’s do.

  31. Personal Opinion • The time complexity of this approach is too high. • The process of computing similarity and clustering are too messy.

More Related