1 / 34

-Clusters Capturing Subspace Correlation in a Large Data Set

-Clusters Capturing Subspace Correlation in a Large Data Set. Authors: Yang Jiong, Wei Wang etc.(ICDE02) Presenter: Xuehua Shen xshen@uiuc.edu. Presentation Layout. Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm. Clustering.

avent
Download Presentation

-Clusters Capturing Subspace Correlation in a Large Data Set

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. -ClustersCapturing Subspace Correlation in a Large Data Set Authors: Yang Jiong, Wei Wang etc.(ICDE02) Presenter: Xuehua Shen xshen@uiuc.edu Data Mining: Concepts and Techniques

  2. Presentation Layout • Overview of Clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques

  3. Clustering • Clustering: the process of grouping a set of objects into classes of similar objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Data Mining: Concepts and Techniques

  4. Major Clustering Methods • Partition algorithm • Hierarchy algorithm • Density-based • Grid-based • Model-based Data Mining: Concepts and Techniques

  5. Similarity • Clustering: the process of grouping a set of objects into classes of similar objects • But how to define similarity? Data Mining: Concepts and Techniques

  6. Similarity cont. • Traditional clustering model: based on distance functions • Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer • But strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance function Data Mining: Concepts and Techniques

  7. Similarity cont. • -Clusters model: similar when exhibiting a coherent pattern on a subset of dimensions • Can cluster objects which show shifting pattern or scaling pattern Data Mining: Concepts and Techniques

  8. Similarity cont. • Example of Coherent Pattern: Shifting Pattern Scaling Pattern Data Mining: Concepts and Techniques

  9. Subspace Clustering • From high dimensional clustering (problematic) To subspace clustering • Not restricted with fixed ordering of columns contrasted with pattern in time-series data • Challenge: curse of dimensionality! Data Mining: Concepts and Techniques

  10. Subspace Clustering cont. • Example of subspace clustering Data Mining: Concepts and Techniques

  11. Applications • Microarray Data Analysis in Biology • E-Commerce Data Mining: Concepts and Techniques

  12. Microarray Data Analysis • Matrix (Dense) Rows: Genes Columns: Various Samples experiment conditions or tissues • Values in Matrix: expression level relative abundance of the mRNA of a gene under a specific condition Data Mining: Concepts and Techniques

  13. Microarray Data Analysis cont. • From Scaling Pattern to Shifting Pattern Red: Interested Gene, Green: Controlled Gene • Investigations show that several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels rise and fall coherently under a subset of conditions Data Mining: Concepts and Techniques

  14. E-Commerce • Example: Rating of Movies (1: lowest rate, 10: highest rate) • Shifting Pattern • If a new movies and 1st viewer rate 7 and 3rd viewer rate 9, 2nd viewer probably will like this movie too Data Mining: Concepts and Techniques

  15. Presentation Layout • Overview of clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques

  16. Related Work • CLIQUE, ORCLUS, PROCLUS (subspace clustering) • Can’t capture neither the shifting pattern nor the scaling pattern • Bicluster model proposed as a measure of coherence of genes and conditions in a submatrix of a DNA array Data Mining: Concepts and Techniques

  17. Bicluster • Model: Mean squared residue score of submatrix: a submatrix AIJ is called a -biCluster if H(I,J) • Algorithm: A random algorithm to give an approximate answer Data Mining: Concepts and Techniques

  18. Weakness of bicluster • Missing Values • Constraints Data Mining: Concepts and Techniques

  19. Presentation Layout • Overview • Related Work • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques

  20. Occupancy Threshold • A parameter to control the percentage of missing values in a submatrix • |J’i| is the specified attributes for object i in -Clusters • |J| is the number of attributes in the -Clusters Data Mining: Concepts and Techniques

  21. Occupancy Threshold cont. • Similar occupancy threshold for attribute j in -Clusters • Example =0.6 Data Mining: Concepts and Techniques

  22. Volume • The volume of a -Clusters(I,J) is the number of specified entries dij in (I,J) • Example volume is 3*3=9 Data Mining: Concepts and Techniques

  23. Base • Object Base • Attribute Base Data Mining: Concepts and Techniques

  24. Base cont. • -Clusters Base • For perfect -Clusters Data Mining: Concepts and Techniques

  25. Residue • Entry Residue if dij is specified otherwise is 0 Data Mining: Concepts and Techniques

  26. Residue cont. • -Clusters Residue • r-residue -Clusters if -clusters residue is equal to or smaller than r Data Mining: Concepts and Techniques

  27. Presentation Layout • Overview of Clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm(Flexible Overlapping Clustering) Data Mining: Concepts and Techniques

  28. Flow Chart Y N Generating initial clusters Determine the best action For each row and each column Perform the best action sequentially improved Data Mining: Concepts and Techniques

  29. Initial Cluster • Randomly Generate k initial cluster • Different parameters  makes different size cluster Data Mining: Concepts and Techniques

  30. Choose best actions • For every object or attribute, there are k actions which can be done, • Choose the best action among the k candidates according to gain • Gain is the difference between original residue and the residue assuming the action is done on the cluster Data Mining: Concepts and Techniques

  31. Choose Best Actions cont. • Even if gain is negative sometimes we do the action in order to get the global optimum Data Mining: Concepts and Techniques

  32. Do the actions sequentially • Generate the actions sequence 1) the same order in all iterations 2) random order sequence 3) weighted random order sequence Data Mining: Concepts and Techniques

  33. Output the Best cluster • After some iterations, no improvement of minimum residue, algorithm stops and k best cluster is output Data Mining: Concepts and Techniques

  34. End • Thank you! Data Mining: Concepts and Techniques

More Related