1 / 41

A Fast PTAS for k-Means Clustering

A Fast PTAS for k-Means Clustering. Dan Feldman, Tel Aviv University, Morteza Monemizadeh, Christian Sohler , Universität Paderborn. Simple coreset for clustering problems Overview. Introduction Weak Coresets Definition Intuition The construction A sketch of analysis The k-means PTAS

Download Presentation

A Fast PTAS for k-Means Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Fast PTAS for k-Means Clustering Dan Feldman, Tel Aviv University, Morteza Monemizadeh, Christian Sohler , Universität Paderborn

  2. Simple coreset for clustering problemsOverview • Introduction • Weak Coresets • Definition • Intuition • The construction • A sketch of analysis • The k-means PTAS • Conclusions

  3. IntroductionClustering • Clustering • Partition input in sets (cluster), such that- Objects in same cluster are similar - Objects in different clusters are dissimilar • Goal • Simplification • Discovery of patterns • Procedure • Map objects to Euclidean space => point set P • Points in same cluster are close • Points in different clusters are far away from eachother

  4. Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Means Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i 2 i pC i i

  5. Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Means Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i 2 i pC i i

  6. Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Means Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i 2 i pC i i

  7. (128,59,88) (218,181,163) IntroductionSimplification / Lossy Compression

  8. IntroductionSimplification / Lossy Compression

  9. IntroductionSimplification / Lossy Compression

  10. IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters

  11. IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters

  12. IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters

  13. IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters

  14. IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters Notation: cost(P,C) denotes the cost of the solution defined this way

  15. Weak CoresetsCentroid Sets • Definition (e-approx. centroid set) • A set S is called e-approximate centroid set, if • it contains a subset C  S s.t. cost(P,C)  (1+e) cost(P,Opt) • Lemma [KSS04] • The centroid of a random set of 2/e points is with constant probability a (1+e)-approx. of the optimal center of P. • Corollary • The set of all centroids of subsets of 2/e points is an e-approx. Centroid set.

  16. Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C)  cost(K,C)  (1+e) cost(P,C) • Point set P (light blue)

  17. Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C)  cost(K,C)  (1+e) cost(P,C) • Set of solution S (yellow)

  18. Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C)  cost(K,C)  (1+e) cost(P,C) • Possible coreset with weights (red) 3 4 5 5 4

  19. Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C)  cost(K,C)  (1+e) cost(P,C) • Approximates cost of k centers (voilett) from S 3 4 5 5 4

  20. Weak CoresetsIdeal Sampling • Problem • Given n numbers a1,…,an >0 • Task: approximate A:=Sai by random sampling • Ideal Sampling • Assign weights w1,…, wn to numbers • wj = avg / aj • Pr[x=j] = aj / avg • Estimator: wxax

  21. Weak CoresetsIdeal Sampling • Problem • Given n numbers a1,…,an >0 • Task: approximate A:=Sai by random sampling • Ideal Sampling • Assign weights w1,…, wn to numbers • wj = avg / aj • Pr[x=j] = aj / avg • Estimator: wxax Properties of estimator: (1) wxax = A (0 variance) (2) Expected weight of number j is 1

  22. Weak CoresetsIdeal Sampling • Problem • Given n numbers a1,…,an >0 • Task: approximate A:=Sai by random sampling • Ideal Sampling • Assign weights w1,…, wn to numbers • wj = A / aj • Pr[x=j] = aj / A • Estimator: wxax Properties of estimator: (1) wxax = A (0 variance) (2) Expected weight of number j is 1 Only problem: Weights can be very large

  23. Weak CoresetsConstruction • Step 1 • Compute constant factor approximation

  24. Weak CoresetsConstruction • Step 2 • Consider each cluster separately

  25. Weak CoresetsConstruction • Step 2 • Consider each cluster separately

  26. Weak CoresetsConstruction • Step 2 • Consider each cluster separately Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)

  27. Weak CoresetsConstruction • Step 2 • Consider each cluster separately But what about high weights? Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)

  28. Weak CoresetsConstruction • Step 2 • A little twist Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)

  29. Weak CoresetsConstruction • Step 3 • A little twist Uniform sampling from small ball Radius = average distance / e Ideal sampling from ‚outliers‘

  30. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘

  31. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius

  32. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius Weight of samples from outliers at most e|C|

  33. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius Forget about outliers!

  34. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘

  35. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ D eD Doesn‘t matter where points lie inside the ball

  36. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (b): nearest center is ‚near‘

  37. Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (b): nearest center is ‚near‘ Almost ideal sampling - Expectation is cost(C,K) - low variance

  38. Weak CoresetsResult • The centroid set • S is set of all centroids of 2/e points (with repetition) from our sample set K • Can show that K approximates all solutions from S • Can show that S is an e-approx. centroid set w.h.p. • Theorem • One can compute in O(nkd) time a weak e-coreset (K,S). The size of K is poly(k, 1/e). S is the set of all centroids of subsets of K of size 2/e.

  39. Weak CoresetsApplications • Fast-k-Means-PTAS(P,k) • Compute weak coreset K • Project K on poly(1/e,k) dimensional space • Exhaustively search for best solution of (projection of) centroid set • Return centroids of the points that create C • Running time: • O(nkd + (k/e) ) ~ O(k/e)

  40. Summary • Weak Coresets • independent of n and d • fast PTAS for k-means • First PTAS for kernel k-means (if the kernel maps into finite dimensional space)

  41. Thank you! Christian Sohler Heinz Nixdorf Institut & Institut für Informatik Universität Paderborn Fürstenallee 11 33102 Paderborn, Germany Tel.: +49 (0) 52 51/60 64 27 Fax: +49 (0) 52 51/62 64 82 E-Mail: csohler@upb.de http://www.upb.de/cs/ag-madh

More Related