Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application Author：Rakesh Agrawal Johannes Gehrke Dimitrios Gunopulos Prabhakar Raghavan

Schedule • (1).Introduction • (2).Motivation • (3).Contributions Of The Paper • (4).Subspace Clustering • (5).CLIQUE(Clustering in Quest) • (6).Performance Experiments • (7).Conclusions

Introduction • (1).Clustering is a descriptive task that seeks to identify homogen- • -eous groups of objects based on their attributes(dimensions). • (2).Clustering techniques have been studied in statistic(Multivariate • Analysis：Cluster Analysis),machine learning and database ( • CLARNANS,BIRCH,DBSCAN) • (3).Clustering techniques：(a)Partitional Clustering：K-Means method,CLARANS,BIRCH,DBSCAN.(b).Hierachical Clustering

Motivation • The need for developing new algorithms • (1).Effective treatment of high dimensionality • To effectively extract information from a huge amount of data • in databases. In other words.The running time of algorithms • must be predictable and usable in large database. • (2).Interpretability of results：User expect clustering results in the • high dimensional data to be interpretable,comprehensible. • (3).Scalability and usability： • Many clustering algorithms don’t well in a large database may • contain millions of objects, Clustering on a sample of a given • data set may lead to biased results.In other words,The clustering • technique should be fast and scale with the number of dimensio- • -ns and the size of input and insensitive to the order of input data.

Contributions of the paper • (1).CLIQUE satisfies the above desiderata (Effective,interpretability, • Scalability and Usability) • (2).CLIQUE can automatically finds subspaces with high-density • clusters • (3).CLIQUE generates a minimal description for each cluster in DNF • expressions. • (4).Empirical evaluation shows that CLIQUE scales linearly with the • number of input records and has good scalability as the number • of dimension in the dimensionality of the hidden cluster.

Subspace Clustering • What’s (a).unit (b).dense unit (c).a cluster (d).a minimal description • of a cluster. • In Figure 1,the two dim space(age,salary) has been partitioned by a 10x10 grid.ξ=10 • The unit u=(30≤age<35)Λ(1≤salary<2) • A and B are both region • A=(30≤age<35)Λ(4≤salary<8) • B =(40≤age<60)Λ(2≤salary<6) • Assuming the dense units have been shaded, • AUB is a cluster( A,B are connected regions) • A∩B is not a maximal region. • The minimal description for this cluster AUB is the • DNF expression： ( (30≤age<35)Λ(4≤salary<8))v • ( (40≤age<60)Λ(2≤salary<6))

In Figure2. Assuming T=20%(density threshold) • If selectivity(u)>T then u is a dense unit. • Where selectivity in the fraction of total data points • Contained in the unit. • No 2-dimen unit is dense and there are no clusters • In the original data space. • But if the points are projected on the salary dimen- • -sion , there are three 1-dim dense units,and ther- • e are two clusters in the 1-dim salary subspace, • C’=5≤salary<7 and D’=2≤salary<3 • But there is no dense unit and cluster in 1-dim age • subspace

CLIQUE Algorithms • CLIQUE consists of the following three steps： • (1).Identification of subspace that contain clusters. • (2).Identification of clusters . • (3).Generation of minimal description for the clusters.

(1).Identification of subspace that contain cluster • The difficulty in identifying subspaces that contain clusters lies in • finding dense units in different subspaces. • (1).we use a bottom-up algorithm to find dense units that exploits • the monotonicity of the clustering criterion with respect to dim- • -ensionality to prune the search space. • Lemma1 (monotonicity)：If k-dim unit is dense ,then so are it’s • projections in (k-1)-dim space. • The bottom-up algorithm procceds • (a).Determines 1-dim dense unit and interaction(self-join) to get • 2-dim dense unit. Until having (k-1)dim dense units, We can • self-join DK-1 to get the candidate k-dim units. • (b).we discard those dense units from Ck which have a projection (k-1)- • -dim that is’nt is not included in Ck-1 . • (2).Making the bottom-up algorithm faster with MDL-base purning.

(2).Finding Clusters • The input to the step of CLIQUE is a set dense units D all in the same k-dim space ,The output will be a partition of D into D1, D2 • ,…,Dq,such that all units in Di are connected and no two units • Ui E Di ,Uj E Dj with iǂj are connected. • Depth-first search algorithm • We use a Depth –first search algorithm to find the connected components of the graph,By starting with some U in D,Assign it the first cluster number and find all the units it is connected to,then if there still are units in D that have not yet been visited , we find one and repeat the procedure.

(3).Generating minimal cluster descriptions • The input to this step consists of disjoint clusters in k-dim subspace. • The goal is to generate a minimal description of each cluster with two steps：(1)Covering with maximal region. • (2).Minimal cover. • (1). Covering with maximal region • We begin with an arbitrary dense unit U1 E C and greedily grow amaximal region R1 that covers U1, we add R1 to R , Then we find another unit U2 E C that is not yet covered by any a maximal regions in R, we greedily grow a maximal region R2 that covers U2, we repeat this procedure until all units in C are covered by some maximal region in R .

How to obtain a maximal region covering a dense unit U

(3).Minimal Cover • The last step of CLIQUE take as input a cover for each cluster and finds a minimal cover. • Minimality is define in terms of the number of maximal regions required to cover the cluster. • Removal heuristic： • (1).Remove from the cover the smallest(in number of units) maximal region which is redundant.(i.e. every unit is also contained in some other maximal region) • (2).Break ties arbitrarily • (3).Reape the procedure until no maximal region can be removed.

Performance Experiments • We now empirically evaluate CLIQUE using synthetic data (Generator from M.Zait and H.Messatfa.A,Comparative study of clustering methods) • The goals of the experiments are to assess the efficiency of CLIQUE： • Efficiency ：Determine how the running time scales with • -Dimensionality of the data space. • -Dimensionality of clusters • - Size of data. • Accuracy：Test if CLIQUE recovers known clusters in some subspaces of a high dimensional data space.

Synthetic data results

Accuracy：Comparisions with BIRCH, DBSCAN and SVD We used clusters embedded in 5-dim subspaces while varying the dimensional of the space from 5 to50. CLIQUE was able to recover all clusters in every case.

Real data results

Conclusions • (1)The problem of high dimensionality is often tackled by requiring the user to specify the subspace for clusteranalysis.But user-identification of quite error-prone.CLIQUE can find clusters embedded in subspaces of high dimensional data without requiring the user to guess subspaces that might have interesting clusters. • (2)CLIQUE generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. • (3)CLIQUE is insensitive to the oder of input recods, Some clustering algorithms are sensitive to the order of input data；for example ,the same set of data,iput in different orderings,may generate dramatically different clusters. • (4).CLIQUE does not presume some canonical data distribution,that is very useful in application.

(5)Empirical evalution shows that CLIQUE scales linearly with the size of input and has good scalability as the number of dimension in the data. • (6)CLIQUE can accurately discover clusters embedded in lower dimensional subspaces.

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application

Presentation Transcript

Data Mining: Clustering

Data Mining Techniques Clustering

Data Mining--Clustering

Clustering in Data Mining

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

High-Dimensional Data

Data Clustering and Mining

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

Chapter 5 Data Mining: Clustering

Efficient Clustering of High-Dimensional Data Sets

High Dimensional Data Analysis

High Performance Data Mining Chapter 3: Clustering

Clustering High Dimensional Data Using SVM

Subspace Clustering

Subspace Clustering

Entropy-based Subspace Clustering for Mining Numerical Data

Multi-Dimensional View of Data Mining

Clustering and Testing in High-Dimensional Data

High Dimensional Data

Subspace Clustering for High Dimensional Data: A Review