1 / 13

Unsupervised pattern recognition models for mixed feature-type symbolic data

Unsupervised pattern recognition models for mixed feature-type symbolic data. Francisco de A.T. de Carvalho *, Renata M.C.R. de Souza PRL, Vol.31 , 2010, pp. 430–443. Presenter : Wei- Shen Tai 20 10 / 3/10. Outline. Introduction

anise
Download Presentation

Unsupervised pattern recognition models for mixed feature-type symbolic data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised pattern recognition models for mixed feature-type symbolic data Francisco de A.T. de Carvalho *, Renata M.C.R. de Souza PRL, Vol.31, 2010, pp. 430–443. Presenter : Wei-Shen Tai 2010/3/10

  2. Outline • Introduction • Dynamic clustering algorithms for mixed feature-type symbolic data • Cluster interpretation • Experimental evaluation • Conclusion remarks • Comments

  3. Motivation • Partitioning dynamical cluster algorithms • None of these former dynamic clustering models are able to manage mixed feature-type symbolic data.

  4. Objective • Dynamic clustering methods for mixed feature-type symbolic data based on suitable adaptive squared Euclidean distances. • To obtain a suitable homogenization of the mixed feature-type symbolic data into histogram-valued symbolic data prior to preprocessing.

  5. Partitioning dynamical clustering • Iterative two-step relocation algorithms • Construct clusters and identify a suitable representation or prototype for each cluster at each iteration. • Optimize a criterion based on a measure of fitting between the clusters and their prototypes • Adaptive dynamic clustering algorithm • Those distances, that compare clusters and their prototypes, can be different from one cluster to another.

  6. Data homogenization pre-processing • Set-valued and list-valued variables • An ordered list-valued variable • Interval-valued variables

  7. Interval-valued variables • X1 is the minimum and the maximum of the gross national product (in millions) • The set of elementary intervals • Country 1 X1[10, 30] • I1=>l([10, 25] ∩[10, 30] ) / l([10, 30]) = 15/ 20 = 0.75 • I2=>l([25, 30] ∩[10, 30] / l([10, 30]) = 5/ 20 = 0.25 • Q2 = 0.75+0.25 = 1.0

  8. Set-valued and list-valued variables • Set A2 = {A=agriculture, C=chemistry, Co=commerce, E=engineering, En=energy, I=information} • Country 1, X2={A, Co} • => {A, C, Co, E, En, I}(½, 0, ½, 0, 0, 0) • If A9 = {worst, bad, fair, good, best} • Country 1 A9=good • => (0, 0, 0, 1, 1)

  9. Squared adaptive Euclidean distances • Single squared adaptive Euclidean distances (global) • The weight vector of each cluster where is the same for all clusters • Cluster squared adaptive Euclidean distances (local) • The weight vectors of each cluster is different from one cluster to another.

  10. Algorithm schema • Pre-processing step: data homogenization • Initialization step: • Randomly choose a partition or randomly choose K distinct objects belonging to X. • Step 1: definition of the best prototypes • Determine the vector weight of each cluster for single squared adaptive Euclidean distances for all clusters. (global) • Step 2: definition of the best vector of weights • Determine the vector weight of each cluster for cluster squared adaptive Euclidean distances for each cluster. (local difference) • Step 3: definition of the best partition • Each Prototype (input ) finds the cluster with the closest distance, and update the vector weight of cluster’s representative prototype . • Stopping criterion • No prototype changes its belonged cluster.

  11. Experimental results • Measurement of the quality of the results • Overall error rate of classification (OERC) • Corrected Rand (CR) • Let U ={u1,…. ui,….uR}and V ={v1,…. vj,….uC}be two partitions of the same data set having respectively R and C clusters. The corrected Rand index is:

  12. Conclusions and remarks • Clustering for mixed feature-type symbolic data based on dynamic clustering methodology with adaptive distances. • It can recognize clusters of different shapes and sizes. • A solution for the best prototype of each cluster with the best adaptive distance for each cluster.

  13. Comments • Advantage • This proposed framework provides a solution for mixed feature-type symbolic data clustering. • It also provides an alternative for the similarity measurement between cluster and input in categorical data via dynamic adaptive distance. • Drawback • If a categorical attribute possesses a larger value set, it will be the determinative attribute in the clustering after they were transformed to histogram. • The hierarchical relationship between categorical data is not considered in this method. • Application • Mixed feature-type data clustering.

More Related