1 / 39

Subspace Clustering/Biclustering

Subspace Clustering/Biclustering. CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu. Data Mining: Clustering. K-means clustering minimizes. Where. Clustering by Pattern Similarity ( p- Clustering).

Download Presentation

Subspace Clustering/Biclustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Subspace Clustering/Biclustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

  2. Data Mining: Clustering K-means clustering minimizes Where

  3. Clustering by Pattern Similarity (p-Clustering) • The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space • Parallel Coordinates Plots • Difficult to find their patterns • “non-traditional” clustering

  4. Clusters Are Clear After Projection

  5. Motivation • E-Commerce: collaborative filtering

  6. Motivation

  7. Motivation

  8. Motivation

  9. Gene Expression Data

  10. Biclustering of Gene Expression Data • Genes not regulated under all conditions • Genes regulated by multiple factors/processes concurrently • Key to determine function of genes • Key to determine classification of conditions

  11. Motivation • DNA microarray analysis

  12. Motivation

  13. Motivation • Strong coherence exhibits by the selected objects on the selected attributes. • They are not necessarily close to each other but rather bear a constant shift. • Object/attribute bias • bi-cluster

  14. Challenges • The set of objects and the set of attributes are usually unknown. • Different objects/attributes may possess different biases and such biases • may be local to the set of selected objects/attributes • are usually unknown in advance • May have many unspecified entries

  15. What’s Biclustering? • Given an n x m matrix, A, find a set of submatrices, Bk, such that the contents of each Bk follow a desired pattern • Row/Column order need not be consistent between different Bks

  16. Bipartite Graphs • Matrix can be thought of as a Graph Rows are one set of vertices L, Columns are another set R • Edges are weighted by the corresponding entries in the matrix If all weights are binary, biclustering becomes biclique finding

  17. Bicluster Structures

  18. Previous Work • Subspace clustering • Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. • Collaborative filtering: Pearson R • Only considers global offset of each object/attribute.

  19. bi-cluster • Consists of a (sub)set of objects and a (sub)set of attributes • Corresponds to a submatrix • Occupancy threshold • Each object/attribute has to be filled by a certain percentage. • Volume: number of specified entries in the submatrix • Base: average value of each object/attribute (in the bi-cluster)

  20. bi-cluster

  21. Bicluster Structures

  22. Cheng and Church • Example: • Correlation between any two columns = correlation between any two rows = 1. • aij = aiJ + aIj – aIJ, where aiJ = mean of row i, aIj = mean of column j, aIJ = mean of A. • Biological meaning: the genes have the same (amount of) response to the conditions.

  23. bi-cluster • Perfect -cluster • Imperfect -cluster • Residue: dij diJ dIJ dIj

  24. Cheng and Church • Model: • A bicluster is represented the submatrix A of the whole expression matrix (the involved rows and columns need not be contiguous in the original matrix). • Each entry Aij in the bicluster is the superposition (summation) of: • The background level • The row (gene) effect • The column (condition) effect • A dataset contains a number of biclusters, which are not necessarily disjoint.

  25. Cheng and Church • Finding the largest -bicluster: • The problem of finding the largest square -bicluster (|I| = |J|) is NP-hard. • Objective function for heuristic methods (to minimize):=> sum of the components from each row and column, which suggests simple greedy algorithms to evaluate each row and column independently.

  26. Cheng and Church • Greedy methods: • Algorithm 0: Brute-force deletion (skipped) • Algorithm 1: Single node deletion • Parameter(s):  (maximum squared residue). • Initialization: the bicluster contains all rows and columns. • Iteration: • Compute all aIj, aiJ, aIJ and H(I, J) for reuse. • Remove a row or column that gives the maximum decrease of H. • Termination: when no action will decrease H or H <= . • Time complexity: O(MN)

  27. Cheng and Church • Greedy methods: • Algorithm 2: Multiple node deletion (take one more parameter . In iteration step 2, delete all rows and columns with row/column residue > H(I, J)). • Algorithm 3: Node addition (allow both additions and deletions of rows/columns).

  28. Cheng and Church • Handling missing values and masking discovered biclusters: replace by random numbers so that no recognizable structures will be introduced. • Data preprocessing: • Yeast: x  100log(105x) • Lymphoma: x  100x (original data is already log-transformed)

  29. Cheng and Church • Some results on yeast cell cycle data (288417):

  30. Cheng and Church • Some results on lymphoma data (402696):

  31. Cheng and Church • Discussion: • Biological validation: comparing with the clusters in previously published results. • No evaluation of the statistical significance of the clusters. • Both the model and the algorithm are not tailored for discovering multiple non-disjoint clusters. • Normalization is of utmost importance for the model, but this issue is not well-discussed.

  32. The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Y Improved? N

  33. The FLOC algorithm • Action: the change of membership of a row(or column) with respect to a cluster column M=4 1 2 3 4 row 3 4 2 2 1 M+N actions are Performed at each iteration 2 1 3 2 3 N=3 3 4 2 0 4

  34. Performance • Microarray data: 2884 genes, 17 conditions • 100 bi-clusters with smallest residue were returned. • Average residue = 10.34 • The average residue of clusters found via the state of the art method in computational biology field is 12.54 • The average volume is 25% bigger • The response time is an order of magnitude faster

  35. Coherent Cluster Want to accommodate noises but not outliers

  36. dxa dxb x x dya z dyb y y a a b b Coherent Cluster • Coherent cluster • Subspace clustering • pair-wise disparity • For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b} mutual bias of attribute a mutual bias of attribute b attribute

  37. Coherent Cluster • A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to . • An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster. • A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster. • Objective: given a data matrix and a threshold , find all maximum -coherent clusters.

  38. Coherent Cluster • Challenges: • Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. • The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. • The actual values of the objects in a coherent cluster may be far apart from each other. • Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

  39. References • J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. • H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. • Y. Sungroh,  C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. • J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.

More Related