Clustering with k -means: faster, smarter, cheaper

Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004

Acknowledgments • Funding from Sun Microsystems, with sponsor Dr. Kenny Gross. • Advice from colleagues and students, especially Sanjoy Dasgupta (UCSD), Greg Hamerly (Baylor University starting Fall ‘04), Doug Turnbull.

Clustering is difficult! Source: Patrick de Smet, University of Ghent

The standard k-means algorithm • Input: n points, distance function d(),number k of clusters to find. • STEP NAME • Start with k centers • Compute d(each point x, each center c) • For each x, find closest center c(x) “ALLOCATE” • If no point has changed “owner” c(x), stop • Each c mean of points owned by it “LOCATE” • Repeat from 2

A typical k-means result

Observations • Theorem: If d() is Euclidean, then k-means converges monotonically to a local minimum of within-class squared distortion: xd(c(x),x)2 • Many variants, complex history since 1956, over 100 papers per year currently • Iterative, related to expectation-maximization (EM) • # of iterations to converge grows slowly with n, k, d • No accepted method exists to discover k.

We want to … • … make the algorithm faster. • … find lower-cost local minima. • (Finding the global optimum is NP-hard.) • … choose the correct k intelligently. • With success at (1), we can try more alternatives for (2). • With success at (2), comparisons for different k are less likely to be misleading.

Is this clustering better?

Or is this better?

Standard initialization methods • Forgy initialization: choose k points at random as starting center locations. • Random partitions: divide the data points randomly into k subsets. • Both these methods are bad. • E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 21(3):768, 1965.

Forgy initialization

k-means result

Smarter initialization • The “furthest first" algorithm (FF): • Pick first center randomly. • Next is the point furthest from the first center. • Third is the point furthest from both previous centers. • In general: next center is argmaxx mincd(x,c) • D. Hochbaum, D. Shmoys. A best possible heuristic for the k-center problem, Mathematics of Operations Research, 10(2):180-184, 1985.

Furthest-first initialization FF Furthest-first initialization

Subset furthest-first (SFF) • FF finds outliers, by definition not good cluster centers! • Can we choose points far apart and typical of the dataset? • Idea: A random sample includes many representative points, but few outliers. • But: How big should the random sample be? • Lemma: Given k equal-size sets and c >1, with high probability ck log k random points intersect each set.

Subset furthest-first c = 2

Comparing initialization methods 218 means 218% worse than the best clustering known. Lower is better.

How to find lower-cost local minima • Random restarts, even initialized well, are inadequate. • The “central limit catastrophe:” almost all local minima are only averagely good. • K. D. Boese, A. B. Kahng, & S. Muddu, A new adaptive multi-start technique for combinatorial global optimizations. Operations Research Letters 16 (1994) 101-113. • The art of designing a local search algorithm: defining a neighborhood rich in improving candidate moves.

Our local search method • k-means alternates two guaranteed-improvement steps: “allocate” and “locate.” • Sadly, we know no other guaranteed-improvement steps. • So, we do non-guaranteed “jump” operations: delete an existing center and create a new center at a data point. • After each “jump”, run k-means to convergence starting with an “allocate” step.

Add a center below Remove a center at left

Theory versus practice • Theorem: Let C be a set of centers such that no “jump” operation improves the value of C. Then C is at most 25 times worse than the global optimum. • T. Kanungo et al. A local search approximation algorithm for clustering, ACM Symposium on Computational Geometry, 2002. • Our aim: Find heuristics to identify “jump” steps that are likely to be good. • Experiments indicate we can solve problems with up to 2000 points and 20 centers optimally.

An upper bound ... • Lemma 1: The maximum loss from removing center c. • Proof: • Suppose b is the center closest to c; let B and C be the subsets owned by b and c, with m = |B| and n = |C|. • If B and C merge, the new center is b’ = (mb+nc)/(m+n). • Because c is the mean of C, for any zx in Cd(z,x)2 = x in Cd(c,x)2 + nd(z,c)2. • So the loss from the merge is nd(b’,c)2 + md(b’,b)2. • This computation is cheap, so we do it for every center.

… and a lower bound • Suppose we add a new center at point z. • Lemma 2: The gain from adding a center at z is at least •  { x : d(x,c(x)) > d(x,z) }d(x,c(x))2 - d(x,z)2. • This computation is more expensive, so we do it for only 2k log k random candidates z.

Sometimes a jump should only be a jiggle • How to use Lemmas 1 and 2: • delete the center with smallest maximum loss, • make new center at point with greatest minimum gain. • This procedure identifies good global improvements. • Small-scale improvements come from “jiggling” the center of an existing cluster: moving the center to a point inside the same cluster.

jj-means: the smarter k-means algorithm • Run k-means with SFF initialization. • Repeat • While improvement do Try the best jump according to Lemmas 1 and 2 • Until improvement do Try a random jiggle • “Try” means run k-means to convergence after. • Insert random jumps to satisfy theorem.

Results with 1000 points, 8 dimensions, 10 centers Conclusion: Running 10x longer is faster and better than restarting 10x.

Goal: Make k-means faster, but with same answer • Allow any black-box d(), • any initialization method. • In later iterations, little movement of centers. • Distance calculations use the most time. • Geometrically, these are mostly redundant. Source: D. Pelleg.

Let x be a point, c(x) its owner, and c a different center. • If we already know d(x,c)  d(x,c(x)) • then computing d(x,c) precisely is not necessary. • Strategy: Use the triangle inequality d(x,z)  d(x,y) + d(y,z)to get sufficient conditions for d(x,c)  d(x,b). • kd-trees are useful up to  10 dimensions. • Distance-based data structures can be better. • Our approach is adaptive.

Lemma 1: Let x be a point, and let b and c be centers. • If d(b,c)  2d(x,b) then d(x,c)  d(x,b). • Proof: We know d(b,c)  d(b,x) + d(x,c). So d(b,c) - d(x,b)  d(x,c). Now d(b,c) - d(x,b)  2d(x,b) - d(x,b) = d(x,b).So d(x,b)  d(x,c). • c • b • x

Lemma 2: Let x be a point, let b and c be centers. • Then d(x,c)  max [ 0, d(x,b) - d(b,c) ]. • Proof: We know d(x,b)  d(x,c) + d(b,c), • So d(x,c)  d(x,b) - d(b,c). • Also d(x,c)  0. • c • b • x

How to use Lemma 1 • Let c(x) be the owner of point x, c' another center: • compute d(x,c') only if • d(x,c(x)) > ½d(c(x),c'). • If we know an upper bound u(x)  d(x,c(x)): • compute d(x,c') and d(x,c(x)) only if • u(x) > ½d(c(x),c'). • If u(x) ½min c'  c(x) [ d(c(x), c') ]: • eliminate all distance calculations for x.

How to use Lemma 2 • Let x be any point, let c be any center, • let c’ be c at previous iteration. • Assume previous lower bound: d(x,c’)  l'. • Then we get a new lower bound for the current iteration: • d(x,c)  max [ 0, d(x,c) - d(c, c’)] •  max [ 0, l' - d(c,c’) ] • If l' is a good approximation, and the center only moves slightly, then we get a good updated approximation.

Pick initial centers c. • For all x and c, compute d(x,c) • Initialize lower bounds l(x,c)  d(x,c) • Initialize upper bounds u(x)  minc d(x,c) • Initialize ownership c(x)  argminc d(x,c) • Repeat until convergence: • Find all x s.t. u(x) ½minc'  c(x) [ d(c(x), c') ] • For each remaining x and c  c(x) s.t. • u(x) > l(x,c) • u(x) > ½d(c(x), c) • Compute d(x,c) and d(x,c(x)) • If d(x,c) < d(x,c(x)) then change owner c(x)  c • Update l(x,c)  d(x,c) and u(x)  d(x,c(x)) • For each c, m(c)  mean of points owned by c • For each x and c, update l(x,c)  max [ 0, l(x,c) - d(m(c),c) ] • For each x, update u(x)  u(x) + d(c(x), m(c(x)) ) • Update each center c  m(c)

Notes on the new algorithm • Empirical issue: which checks to do in which order. • Implement “for each remaining x and c” by looping over c, with vectorized code processing all x together. • Or, sequentially scan x and l(x,c) from disk. • Obvious initialization computes O(nk) distances. Faster methods give inaccurate l(x,c) and u(x), hence may do more distance calculations later.

Experimental observations • Natural clusters are found while computing the distance between each point and each center less than once! • We find k = 100 clusters in n = 100,000 covtype points with 7,353,400 < nk = 15,000,000 distance calculations. • Number of distance calculations is o(kc), because later iterations compute very few distances.

Current limitations • Computing distances is no longer the dominant cost. • Reason: After each iteration, we • update nk lower bounds l(x,c) • use O(kd) time to recompute k means • use O(k2d) time to recompute all inter-center distances • Moreover, we can approximate distances in o(d) time, by considering the largest dimensions first.

Deeper questions • What is the minimum # of distance calculations needed? • Adversary argument? If some calculations are omitted, an opponent can choose their values to make any clustering algorithm’s output incorrect. • Can we extend to clustering with general Bregman divergences? • Can we extend to soft-assignment clustering? Via lower and upper bounds on weights?

Clustering with k -means: faster, smarter, cheaper