1 / 23

Eric Bae and James Bailey

COALA : A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity. Eric Bae and James Bailey Proceedings of the IEEE International Conference on Data Mining, ICDM, 2006. 報告人 : 吳建良. Outline. Motivation Problem Definition COALA

Download Presentation

Eric Bae and James Bailey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COALA : A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity Eric Bae and James Bailey Proceedings of the IEEE International Conference on Data Mining, ICDM, 2006 報告人:吳建良

  2. Outline • Motivation • Problem Definition • COALA • Quantitative Evaluation • Experiment

  3. Motivation • Traditional clustering technique • Produce only a single solution • Difficult for user to validate whether the solution is in fact appropriate, particularly if the dataset is large and complex • The user has limited knowledge about the clustering algorithm being used • Provide another, alternative clustering solution • High quality, yet different from the original solution

  4. Related Work - Ensemble Clustering • Objective • Generate multiple clusterings • Merge them to offer a final consensus clustering • Method • Apply many algorithms • Change initial conditions of an algorithm • Random samples of data

  5. Challenges • An inability to know which algorithms to apply and how many • A difficulty in quantitatively evaluating the degree of (dis)similarity/quality for the candidate solutions • The inefficiency of running algorithms multiple times

  6. Requirement of clustering • Dissimilarity requirement • Given two clusterings C and S, they can be presented as solutions if they are as dissimilar from one another as possible • Cannot-link constraint • Quality requirement • Given two clusterings C and S, they can be considered as solutions if they are both high quality clusterings • Quality threshold ω

  7. Problem Definition • Problem definition • Given a clustering C (provided as pre-defined class labels) with r clusters, find a second clustering S with r clusters, having high dissimilarity to C, but also satisfying the quality requirement threshold ω

  8. Notation • D={x1, x2, …, xn}: a set of n objects • C={c1, c2, …,cr}: existing clustering (background knowledge) • S={s1, s2, …, sr}: new clustering with respect to C • d(ci, cj): the distance between cluster ci and cj • Average linkage • Compute average distance of all pairwise objects between cluster ci and cj

  9. Cannot-link Constraints • Cannot-link constraint • A pair of distinct data object (xi, xj) • In any feasible clustering, objects xi and xjmust not be in the same cluster • Use cannot-link constraint to ensure that the second clustering S is dissimilar form the given clustering C • Each pair of objects which is in the same cluster in C is added to constraint set L

  10. COALA • Agglomerative hierarchical clustering algorithm • Start by treating each object as a single cluster • At each iteration, two candidate pairs of clusters for a possible merge are found • Qualitative pair (cq1, cq2) • The minimum distance over all the pairs of clusters • Dissimilar pair (co1, co2) • The minimum distance over all the pairs of clusters that also satisfy the cannnot-link constraints

  11. COALA (contd.) • Quality threshold ω: balance the trade-off between the qualitative merge and dissimilar merge • If then merge(cq1, cq2) else merge(co1, co2) • If no cluster actually satisfy the cannot-link constraints, it proceeds with merges of the qualitative pairs

  12. Example C={{A,B,C,D}, {E,F}} C A D E B F Initialization: each point forms a cluster Cannot-link constraint set L: ω=0.6 (A,B) (A,C), (A,D) (B,C) (B,D) (C,D) (E,F) C A D E B F

  13. Example (contd.) Minimal qualitative pair: (A,B) or (B,C) or (D,F) Minimal dissimilar pair: (D,F) Suppose pick (A,B) as qualitative pair  Merge dissimilar pair (D,F) Result: C A D E B F

  14. Example (contd.) Minimal qualitative pair: (A,B) or (B,C) Minimal dissimilar pair: (C,E) Suppose pick (A,B) as qualitative pair  Merge qualitative pair (A,B) Result: C A D E B F

  15. Example (contd.) Minimal qualitative pair: ({A,B},C) Minimal dissimilar pair: (C,E)  Merge qualitative pair ({A,B},C) Result: C A D E B F

  16. Example (contd.) Minimal qualitative pair: ({D,F},E) Minimal dissimilar pair: ({A,B,C},E)  Merge qualitative pair ({D,F},E) Result: C A D E B F

  17. Quantitative Evaluation • Dissimilarity • Jaccard index: • N11: the number of pairs of points in the same cluster for both C and S • N00: the number of pairs that are in different clusters in C and S • N01 and N10: the number of pairs where a pair belongs to the same cluster in one clustering, but not the other • Quality • Dunn index: • δ: cluster-to-cluster distance • Δ: cluster diameter measure

  18. Quantitative Evaluation (contd.) • Jaccard index value ↓ dissimilarity ↑ • Dunn index value ↑ quality ↑ • Overall clustering score

  19. Experiment • Synthetic datasets

  20. Two competing approaches • Naïve method • Apply k-means algorithm three times using different initial points • Select two clusterings with the highest DQ • Select a clustering with higher quality as a ‘known’ clustering from those two clusterings • CIB (Conditional information bottleneck) • Retrieval dissimilar clusterings • Find the optimal assignment of objects to clusters while preserving as much information of features conditioned on the information provided by pre-defined class labels

  21. Result

  22. Result (contd.) • Four real word datasets

  23. Impact of quality threshold ω

More Related