1 / 33

Data reduction for weighted and outlier-resistant clustering

Data reduction for weighted and outlier-resistant clustering. Leonard J. Schulman Caltech joint with Dan Feldman MIT. Talk outline. Clustering-type problems: k-median weighted k-median k-median with m outliers (small m) k-median with penalty (clustering with many outliers)

delora
Download Presentation

Data reduction for weighted and outlier-resistant clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech joint with Dan Feldman MIT

  2. Talk outline • Clustering-type problems: • k-median • weighted k-median • k-median with m outliers (small m) • k-median with penalty (clustering with many outliers) • k-line median • Unifying framework: tame loss functions • Core-sets, a.k.a. -approximations • Common existence proof and algorithm

  3. Voronoi regions have spherical boundaries

  4. k-Median with penalty

  5. k-Median with penalty: good for outliers 2-median clustering of a data set: Same data set plus an outlier: Now cluster with h-robust loss function:

  6. Related work and our results

  7. Why are all these problems in the same paper? In each case the objective function is a suitably tame “loss function”. The loss in representing a point p by a center c is: k-median: D(p) = dist(p,c) Weighted k-median: D(p) = w · dist(p,c) Robust k-median: D(p) = min{h, dist(p,c)} What qualifies as a “tame” loss function?

  8. Log-Log Lipschitz (LgLgLp) condition on the loss function

  9. Many examples of LgLgLp loss functions: Robust M-estimators in Statistics figure: Z. Zhang

  10. Classic Data Reduction

  11. Same notion for LgLgLp loss functions

  12. k-clustering core-set for loss D

  13. Weighted-k-clustering core-set for loss D Handling arbitrary-weight centers is the “hard part”

  14. Our main technical result • For every LgLgLp loss fcn D on a metric space, for every set P of n points, there is a weighted-(D,k)-core-set S of size |S| = O(log2 n) (In more detail: |S|=(dkO(k)/2) log2 n in Rd. For finite metrics, d=log n.) • S can be computed in time O(n)

  15. Sensitivity [Langberg and S, SODA’11] The sensitivity of a point p  P determines how important it is to include P in a core-set: Why this works: If s(p) is small, then p has many “surrogates” in the data, we can take any one of them for the core-set. If s(p) is large, then there is some C for which p alone contributes a significant fraction of the loss, so we need to include p in any core-set. DW(p,C) s(p) = maxC qP DW(q,C)

  16. Total sensitivity The total sensitivity T(P) is the sum of the sensitivities of all the points: The total sensitivity of the problem is the maximum of T(P) over all input sets P. Total sensitivity ~ n: cannot have small core-sets. Total sensitivity constant or polylog: there may exist small core-sets. T(P)=sP s(p)

  17. Small total sensitivity  Small coreset

  18. Small total sensitivity  Small core-set

  19. The main thing we need to do in order to produce a small core-set for weighted-k-median: For each p  P compute a good upper bound on s(p) in amortized O(1) time per point. (Upper bound should be good enough that s(p) is small)

  20. Algorithm for computing sensitivities Recursive-Robust-Median(P,k) • Input: • A set P of n points in a metric space • An integer k 1 • Output: • A subset Q  P of (n/kk) points We prove that any two points in Q can serve as each others’ surrogates w.r.t. any query. Hence each point p  Q has sensitivity s(p)  O(1/|Q|). Outer loop: Call Recursive-Robust-Median(P,k), then set P:=P-Q. Repeat until P is empty. Total sensitivity bd: T  # calls to Recursive-Robust-Median  kk log n.

  21. The algorithm to find the (n)–size set Q:

  22. Recursive-Robust-Median: illustration c* c*

  23. Recursive-Robust-Median: illustration c*

  24. A detail Actually it’s more complicated than described because we can’t afford to look for a (1+)-approximation, or even a 2-approximation, to the best k-median of any b·n points (b constant). Instead look for a bicriteria approximation: a 2-approximation of the best k-median of any b·n/2 points. Linear time algorithm from [F,Langberg STOC’11].

  25. High-level intuition for the correctness of Recursive-Robust-Median Consider any p in the “output” set Q. If for all queries C, D(p,C) is small, then p has low sensitivity. If there is a query C for which D(p,C) is large then in that query, all points of Q are assigned to the same center c  C, and are closer to each other than to c; so they are surrogates.

  26. Thank you

  27. appendices

  28. Many examples of LgLgLp loss functions: Robust M-estimators in Statistics …

More Related