Unsupervised pattern recognition models for mixed feature-type symbolic data

Unsupervised pattern recognition models for mixed feature-type symbolic data Francisco de A.T. de Carvalho *, Renata M.C.R. de Souza PRL, Vol.31, 2010, pp. 430–443. Presenter : Wei-Shen Tai 2010/3/10

Outline • Introduction • Dynamic clustering algorithms for mixed feature-type symbolic data • Cluster interpretation • Experimental evaluation • Conclusion remarks • Comments

Motivation • Partitioning dynamical cluster algorithms • None of these former dynamic clustering models are able to manage mixed feature-type symbolic data.

Objective • Dynamic clustering methods for mixed feature-type symbolic data based on suitable adaptive squared Euclidean distances. • To obtain a suitable homogenization of the mixed feature-type symbolic data into histogram-valued symbolic data prior to preprocessing.

Partitioning dynamical clustering • Iterative two-step relocation algorithms • Construct clusters and identify a suitable representation or prototype for each cluster at each iteration. • Optimize a criterion based on a measure of fitting between the clusters and their prototypes • Adaptive dynamic clustering algorithm • Those distances, that compare clusters and their prototypes, can be different from one cluster to another.

Data homogenization pre-processing • Set-valued and list-valued variables • An ordered list-valued variable • Interval-valued variables

Interval-valued variables • X1 is the minimum and the maximum of the gross national product (in millions) • The set of elementary intervals • Country 1 X1[10, 30] • I1=>l([10, 25] ∩[10, 30] ) / l([10, 30]) = 15/ 20 = 0.75 • I2=>l([25, 30] ∩[10, 30] / l([10, 30]) = 5/ 20 = 0.25 • Q2 = 0.75+0.25 = 1.0

Set-valued and list-valued variables • Set A2 = {A=agriculture, C=chemistry, Co=commerce, E=engineering, En=energy, I=information} • Country 1, X2={A, Co} • => {A, C, Co, E, En, I}(½, 0, ½, 0, 0, 0) • If A9 = {worst, bad, fair, good, best} • Country 1 A9=good • => (0, 0, 0, 1, 1)

Squared adaptive Euclidean distances • Single squared adaptive Euclidean distances (global) • The weight vector of each cluster where is the same for all clusters • Cluster squared adaptive Euclidean distances (local) • The weight vectors of each cluster is different from one cluster to another.

Algorithm schema • Pre-processing step: data homogenization • Initialization step: • Randomly choose a partition or randomly choose K distinct objects belonging to X. • Step 1: definition of the best prototypes • Determine the vector weight of each cluster for single squared adaptive Euclidean distances for all clusters. (global) • Step 2: definition of the best vector of weights • Determine the vector weight of each cluster for cluster squared adaptive Euclidean distances for each cluster. (local difference) • Step 3: definition of the best partition • Each Prototype (input ) finds the cluster with the closest distance, and update the vector weight of cluster’s representative prototype . • Stopping criterion • No prototype changes its belonged cluster.

Experimental results • Measurement of the quality of the results • Overall error rate of classification (OERC) • Corrected Rand (CR) • Let U ={u1,…. ui,….uR}and V ={v1,…. vj,….uC}be two partitions of the same data set having respectively R and C clusters. The corrected Rand index is:

Conclusions and remarks • Clustering for mixed feature-type symbolic data based on dynamic clustering methodology with adaptive distances. • It can recognize clusters of different shapes and sizes. • A solution for the best prototype of each cluster with the best adaptive distance for each cluster.

Comments • Advantage • This proposed framework provides a solution for mixed feature-type symbolic data clustering. • It also provides an alternative for the similarity measurement between cluster and input in categorical data via dynamic adaptive distance. • Drawback • If a categorical attribute possesses a larger value set, it will be the determinative attribute in the clustering after they were transformed to histogram. • The hierarchical relationship between categorical data is not considered in this method. • Application • Mixed feature-type data clustering.

Unsupervised pattern recognition models for mixed feature-type symbolic data

Unsupervised pattern recognition models for mixed feature-type symbolic data

Presentation Transcript

Pattern Recognition

Pattern recognition

Pattern Recognition

Unsupervised Feature Selection for Linked Social Media Data

Unsupervised Pattern Recognition for the Interpretation of Ecological Data

Unsupervised Feature Selection for Multi-Cluster Data

Pattern Recognition

Pattern Recognition

Feature Selection for Pattern Recognition

Fuzzy Models for Pattern Recognition Def.:

Unsupervised Evolutionary Clustering Algorithm for Mixed Type Data

Pattern Recognition

Unsupervised Feature Selection for Linked Social Media Data

Part1 Markov Models for Pattern Recognition – Introduction

Symbolic Models

Pattern Recognition

Pattern Recognition

Mixed models for sensory data

Pattern Recognition

Pattern Recognition