1 / 26

A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing

This paper presents a consensus framework for integrating distributed clusterings under limited knowledge sharing, with applications in knowledge reuse, distributed data mining, and cluster ensembles.

Download Presentation

A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Consensus Framework for Integrating Distributed ClusteringsUnder Limited Knowledge Sharing Joydeep Ghosh, Alex Strehl, Srujana Merugu The University of Texas at Austin Joydeep Ghosh UT-ECE

  2. Setting • given multiple clusterings • possibly distributed in time and space • Possibly non-identical sets of objects • obtain a single integrated clustering • w/o sharing algos or features (records) Joydeep Ghosh UT-ECE

  3. Application Scenarios • Knowledge reuse • Consolidate legacy clusterings without accessing detailed object descriptions • Distributed Data Mining • Only some features available per clusterer • Only some objects available per clusterer • (Improve quality and robustness) • Reduce variance • Good results on a wide range of data using a diverse portfolio of algorithms • Estimate reasonable K Joydeep Ghosh UT-ECE

  4. Cluster Ensembles • Given a set of provisional partitionings, we want to aggregate them into a single consensus partitioning, even without access to original features . (individual cluster labels) Clusterer #1 (consensus labels) Joydeep Ghosh UT-ECE

  5. Cluster Ensemble Problem • Let there be r clusterings (r) with k(r) clusters each • What is the integrated clustering  that optimally summarizes the r given clusterings using k clusters? Much more difficult than Classification ensembles Joydeep Ghosh UT-ECE

  6. What is “best” consensus? Maximize average [0, 1]-normalized mutual information with all the individual labelings of the ensemble, , given the number of clusters, k. |Normalized mutual information (NMI) between r.v.s X, Y, NMI(X,Y) = I(X,Y) / sqrt { H(X) H(Y)} Empirical Validation Joydeep Ghosh UT-ECE

  7. Designing a Consensus Function • Direct optimization – impractical • Three efficient heuristics • Cluster-based Similarity Partitioning Alg. (CSPA) • O( n2 k r) • HyperGraph Partitioning Alg. (HGPA) • O( n k r) • Meta-Clustering Alg. (MCLA) • O( n k2 r2) All 3 exploit a hypergraph representation of the sets of cluster labels (input to consensus function) • Supra-consensus function : performs allthree andpicks the one with highest ANMI (fully unsupervised) Joydeep Ghosh UT-ECE

  8. Hypergraph Representation • One hyperedge/cluster • Example: Joydeep Ghosh UT-ECE

  9. Cluster-based Similarity Partitioning (CSPA) • Pairwise object similarity = # of shared hyperedges • Cluster objects based on “consensus” similarity matrix • using e.g., graph-partitioning Joydeep Ghosh UT-ECE

  10. HyperGraph Partitioning Alg. (HGPA) • Partition the hypergraph so that a minimum number of hyperedges are cut • Hypergraph partitioning is a well-known problem from e.g., VLSI • We use HMETIS 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Joydeep Ghosh UT-ECE

  11. Meta-CLustering Algorithm (MCLA) Build a meta-graph such that • vertex is cluster (vertex weight is cluster size) • edge weight is similarity between clusters • Similarity = intersection/union (Jaccard distance between ha and hb) • Balanced partitioning of this r-partite graph (METIS) • Assign each object to best matching meta-cluster Joydeep Ghosh UT-ECE

  12. MCLA Example 1/2 • Vertex is cluster • Weight is Jaccard from haand hb Joydeep Ghosh UT-ECE

  13. MCLA Example 2/2  :  :  : • In this illustrative example, CSPA, HGPA, and MCLA give the same result Joydeep Ghosh UT-ECE

  14. Applications and Experiments • Proprietary Datasets • Data-sets • 2-dimensional 2-Gaussian simulated data(k=2, d=2, n=1000) • 5 Gaussians in 8-dimensions (k=5, d=8, n=1000) • Pen digit data (k=3, d=4, n=7494) • Yahoo news web-document data (k=40, d=2903, n=2340) • application setups • Feature Distributed Clustering (FDC) • Integrating clusters of varying resolution • Robust consensus • Object distributed clustering • EXTRINSIC EVALUATION Joydeep Ghosh UT-ECE

  15. FDC Example • Data: 5 Gaussians in 8 dimensions • Experiment: 5 clusterings in 2-dimensional subspaces • Result: Avg. ind. 0.70, best ind. 0.77, ensemble 0.99 Joydeep Ghosh UT-ECE

  16. Experimental Results FDC • Reference clustering and consensus clustering • Ensemble always equal or better than individual: • More than double the avg. individual quality in YAHOO! Joydeep Ghosh UT-ECE

  17. Combining Clusterings of Different Resolutions Motivation • Robust combining of cluster solutions of different resolutions, produced in real life distributed data scenarios. • Ensemble helps estimate the “natural” number of clusters • Use ANMI to guide k Joydeep Ghosh UT-ECE

  18. Experiments Table 1: Details of datasets and cluster ensembles with varying k. Joydeep Ghosh UT-ECE

  19. Behavior of ANMI w.r.t. k (#clusters) Joydeep Ghosh UT-ECE

  20. ANMI vs. NMI Correlation (match with ground truth - ) (match with consensus - ) correlation coeff 0.923, except Yahoo (0.61) Joydeep Ghosh UT-ECE

  21. Results-1 Table 2: Quality of clusterings in terms of NMI w.r.t corresponding original categorization Joydeep Ghosh UT-ECE

  22. Object Distributed Clustering (ODC) Scenario: Data is divided into p overlapping partitions; each object is on an average repeated  times. Advantages: • Distributed clustering • Speeds up when the inner clustering algorithm has super-linear complexity and a fast consensus function (MCLA, HGPA) is used. • For an O(n2) clustering algorithm and O(n) consensus function, • asymptotic sequential speedup is p/2 (e.g. Clustering YAHOO data can be sped up 64 fold with 16 processors retaining 80% of the full length quality, assuming repetition factor = 2) • Easily yields to p-fold parallelization Experiments: • individual partitions were clustered using graph partitioning • results were combined using the consensus framework. Joydeep Ghosh UT-ECE

  23. ODC -Results Figure 3: Clustering quality (measured by relative mutual information) as a function of the number of partitions, p, on the various data sets (a) 2D2K (b)8D5K (c) PENDIG (d) YAHOO. The repetition factor  was set to 2 and graph partitioning was used for clustering the data. Joydeep Ghosh UT-ECE

  24. Robust Consensus Clustering (RCC) • Goal: Create an `auto-focus´ clusterer that works for a wide variety of data-sets • Diverse portfolio of 10 approaches • SOM, HGP • GP (Eucl, Corr, Cosi, XJac) • KM (Eucl, Corr, Cosi, XJac) • Each approach is run on the same subsample of the data and the 10 clusterings combined using our supra-consensus function • Evaluation using increase in NMI of supra-consensus results increase over Random Joydeep Ghosh UT-ECE

  25. Robustness Summary • Avg.qualityversusensemblequality • For severalsamplesizes n(50,100,200,400,800) • 10-fold exp. • ±1 standarddeviation bars Joydeep Ghosh UT-ECE

  26. Remarks • Cluster ensembles • Enable knowledge reuse • Work with distributed data with strong privacy constraints • Improve quality & robustness • Are yet largely unexplored • Future work • Combining soft clusterings • Preferential consensus • What if (some) Features are known? • What if segments are ordered? • Applications & Data Sets • Bioinformatics • Papers, data, demos & code at http://strehl.com/ Joydeep Ghosh UT-ECE

More Related