1 / 17

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach. Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu. Contents. Motivation Random projection and the cluster ensemble approach Experimental results Conclusion. Motivation.

pisces
Download Presentation

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu

  2. Contents • Motivation • Random projection and the cluster ensemble approach • Experimental results • Conclusion

  3. Motivation • High dimensionality poses two challenges for unsupervised learning • The presence of irrelevant and noisy features can mislead the clustering algorithm. • In high dimensions, data may be sparse, making it difficult to find any structure in the data. • Two basic approaches to reduce the dimensionality • Feature subset selection; • Feature transformation-PCA, random projection.

  4. Motivation • Random projection • Advantage • A general data reduction technique; • Has been shown to have special promise for high dimensional data clustering. • Disadvantage • Highly unstable. Different random projections may lead to radically different clustering results.

  5. Idea • Aggregate multiple runs of clusterings to achieve better clustering performance. • A single run of clustering consists of applying random projection to the high dimensional data and clustering the reduced data using EM. • Multiple runs of clustering are performed and the results are aggregated to form an nn similarity matrix. • An agglomerative clustering algorithm is then applied to the matrix to produce the final clusters.

  6. A single run • Random projection: X’=X  R • X’: n  d’, reduced-dimension data set • X : n  d , high-dimensional data set • R: d  d’, which is generated by first setting each entry of the matrix to a value drawn from an i.i.d N(0,1) distribution and then normalizing the columns to unit length. • EM clustering

  7. Aggregating multiple clustering results • The probability that data point i belongs to each cluster under the model : • The probability that data point i and j belongs to the same cluster under the model :

  8. Pij forms a “similarity” matrix.

  9. Producing final clusters

  10. How to decide k? We can use the occurrence of a sudden similarity drop as a heuristic to determine k.

  11. Experimental results • Evaluation Criteria • Conditional Entropy (CE): measures the uncertainty of the class labels given a clustering solution. • Normalized Mutual Information (NMI) between the distribution of class labels and the distribution of cluster labels. • CE: the smaller the better. NMI: the larger the better.

  12. Experimental results • Cluster ensemble versus single RP+EM

  13. Experimental results • Cluster ensemble versus PCA+EM

  14. Experimental results • Cluster ensemble versus PCA+EM

  15. Analysis of Diversity for Cluster Ensembles • Diversity: the NMI between each pair of clustering solutions. • Quality: average the NMI values between each of the solutions and the class labels

  16. Conclusion • Techniques have been investigated to produce and combine multiple clusterings in order to achieve an improved final clustering. • The major contribution of this paper:1)Examined random projection for high dimensional data clustering and identified its instability problem; 2)formed a novel cluster ensemble framework based on random projection and demonstrated its effectiveness for high dimensional data clustering; and 3) identified the importance of the quality and diversity of individual clustering solutions and illustrated their influence on the ensemble performance with empirical results.

More Related