1 / 23

Correlation Clustering

Correlation Clustering. Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum. Document Clustering. Given a bunch of documents, classify them into salient topics Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown

Download Presentation

Correlation Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum

  2. Document Clustering • Given a bunch of documents, classify them into salient topics • Typical characteristics: • No well-defined “similarity metric” • Number of clusters is unknown • No predefined topics – desirable to figure them out as part of the algorithm Shuchi Chawla, Carnegie Mellon University

  3. Research Communities • Given data on research papers, divide researchers into communities by co-authorship • Typical characteristics: • How to divide really depends on the given set of researchers • Fuzzy boundaries Shuchi Chawla, Carnegie Mellon University

  4. Traditional Approaches to Clustering • Approximation algorithms • k-means, k-median, k-min sum • Matrix methods • Spectral Clustering • AI techniques • EM, classification algorithms Shuchi Chawla, Carnegie Mellon University

  5. Problems with traditional approaches • Dependence on underlying metric • Objective functions are meaningless without a metric eg. k-means • Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods Shuchi Chawla, Carnegie Mellon University

  6. Problems with traditional approaches • Fixed number of clusters • Meaningless without prespecified number of clusters eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster Shuchi Chawla, Carnegie Mellon University

  7. Problems with traditional approaches • No clean notion of “quality” of clustering • Objective functions do not directly translate to how many items have been grouped wrongly • Heuristic approaches • Objective functions derived from generative models Shuchi Chawla, Carnegie Mellon University

  8. Cohen, McCallum & Richman’s idea Our Task • “Learn” a similarity measure on documents • may not be a metric! • f(x,y) = amount of similarity between x and y Use labeled data to train up this function • Classify all pairs with the learned function • Find the “most consistent” clustering Shuchi Chawla, Carnegie Mellon University

  9. An example +: Same -: Different • Consistent clustering: + edges inside clusters - edges between clusters Harry B. Harry Bovik H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

  10. An example Disagreement Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

  11. An example Disagreement • Task: Find most consistent clustering or, fewest possible disagreements equivalently, maximum possible agreements Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

  12. Correlation clustering • Given a complete graph – Each edge labeled ‘+’ or ‘-’ • Our measure of clustering – How many labels does it agree with? • Number of clusters depends on the edge labels • NP-complete; We consider approximations Shuchi Chawla, Carnegie Mellon University

  13. Compared to traditional approaches… • Do not have to specify k • No condition on weights – can be arbitrary • Clean notion of quality of clustering – number of examples where the clustering differs from f • If a good (perfect) clustering exists, it is easy to find Shuchi Chawla, Carnegie Mellon University

  14. Some machine learning justification • Noise Removal • There is some true classification function f • But there are a few errors in the data • We want to find the true function • Agnostic Learning • There is no inherent clustering • Try to find the best representation using a hypothesis with limited expressivity Shuchi Chawla, Carnegie Mellon University

  15. Our results • Constant factor approximation for minimizing disagreements • PTAS for maximizing agreements • Results for the random noise case Shuchi Chawla, Carnegie Mellon University

  16. Minimizing Disagreements • Goal: constant approximation • Problem: Even if we find a cluster as good as one in OPT, we are headed towards a log n approximation (a set-cover like bound) • Idea: lower bound DOPT Shuchi Chawla, Carnegie Mellon University

  17. Lower Bounding Idea: Bad Triangles Consider + - “Bad Triangle” + We know any clustering has to disagree with at least one of these edges. Shuchi Chawla, Carnegie Mellon University

  18. Lower Bounding Idea: Bad Triangles - If several edge-disjoint bad triangles, then any clustering makes a mistake on each one + + 1 2 Edge disjoint Bad Triangles (1,2,3), (1,4,5) 5 2 4 3 Dopt #{Edge disjoint bad triangles} Shuchi Chawla, Carnegie Mellon University

  19. Using the lower bound • d-clean cluster: cluster C where each node has fewer than d|C| “bad” edges • d-clean clusters have few bad triangles => few mistakes • Possible solution: find a d-clean clustering • Caveat: It may not exist Shuchi Chawla, Carnegie Mellon University

  20. Using the lower bound • We show:  a clustering with clusters that are d-clean or singleton • Further, it has few mistakes • Nice structure helps us find it easily. • Caveat: A d-clean clustering may not exist Shuchi Chawla, Carnegie Mellon University

  21. Extensions & Open Problems • Weighted edges or incomplete graph • Recent work by Bartal et al • log-approximation based on multiway cut • Better constant for unweighted case • Can we use bad triangles (or polygons) more directly for a tighter bound? • Experimental performance Shuchi Chawla, Carnegie Mellon University

  22. Other problems I have worked on • Game Theory and Mechanism Design • Approx for Orienteering & related problems • Online search algorithms based on Machine Learning approaches • Theoretical properties of Power Law graphs • Currently working on Privacy with Cynthia Shuchi Chawla, Carnegie Mellon University

  23. Thanks! Shuchi Chawla, Carnegie Mellon University

More Related