What is this talk about?. Two popular but poorly understood algorithmsFast in practiceNobody knows whyFind (highly) sub-optimal solutionsThe big questions:What makes these algorithms fast?How can the solutions they find be improved?. What is k-means?. Divides point-set into k
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
1. Analyzing and Improving Local Search: k-means and ICP David Arthur
Special University Oral Exam
Stanford University
2. What is this talk about? Two popular but poorly understood algorithms
Fast in practice
Nobody knows why
Find (highly) sub-optimal solutions
The big questions:
What makes these algorithms fast?
How can the solutions they find be improved?
3. What is k-means? Divides point-set into k tightly-packed clusters
4. What is ICP (Iterative Closest Point)? Finds a subset of A similar to B
5. Main outline Focus on k-means in talk
Goals:
Understand running time
Harder than you might think!
Worst case = exponential
Smoothed = polynomial
Find better clusterings
k-means++ (modification of k-means)
Provably near-optimal
In practice: faster and more accurate than the competition
6. Main outline What is k-means?
k-means worst-case complexity
k-means smoothed complexity
k-means++
7. Main outline What is k-means?
What exactly is being solved?
Why this algorithm?
How does it work?
k-means worst-case complexity
k-means smoothed complexity
k-means++
8. The k-means problem Input:
An integer k
A set X of n points in Rd
Task:
Partition the points into k clusters C1, C2,, Ck
Also choose centers c1, c2,, ck for the clusters
9. The k-means problem Input:
An integer k
A set X of n points in Rd
Task:
Partition the points into k clusters C1, C2,, Ck
Also choose centers c1, c2,, ck for the clusters
Minimize objective function:
where c(x) is the center of the cluster containing x
Similar to variance
10. The k-means problem Problem is NP-hard
Even when k = 2 (Drineas et al., 04)
Even when d = 2 (Mahajan et al., 09)
Some 1+e approximation algorithms known:
Example running times:
O(n + kk+2 e (2d+1)k logk+1(n) logk+1(1/e))
(Har-Peled and Mazumdar, 04)
O(2(k/e)^O(1)dn)
(Kumar et al., 04)
All exponential (or worse) in k
11. The k-means problem An example real-world data set:
From the UC-Irvine Machine Learning Repository
Looking to detect malevolent network connections:
n=494,021
k=100
d=38
1+e approximation algorithms too slow for this!
12. k-means method The fast way:
k-means method (Lloyd 82, MacQueen 67)
By far the most popular clustering algorithm used in scientific and industrial applications (Berkhin, 02)
13. k-means method Start with k arbitrary centers {ci}
(In practice: chosen at random from data points)
14. k-means method Start with k arbitrary centers {ci}
(In practice: chosen at random from data points)
15. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
16. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
17. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
18. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
19. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
20. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
21. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
22. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
23. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
24. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
25. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
26. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
27. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
28. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
29. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
30. k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
31. Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
k-means method
32. What is known already Number of iterations:
Finite!
Each iteration decreases f
In practice:
Sub-linear (Duda, 00)
Worst-case:
O(n) (Har-Peled and Sadri, 04)
min(O(n3kd), O(kn)) (Inaba et al., 00)
Very large gap!
33. Accuracy:
Only finds a local optimum
Local optimum can be arbitrarily bad
i.e., f / fOPT unbounded
Even with high probability in 1 dimension
Even in natural examples: well separated Gaussians What is known already
34. Main outline What is k-means?
k-means worst-case complexity*
Number of iterations can be:
Super-polynomial!
k-means smoothed complexity
k-means++
* How slow is the k-means method? (Arthur and Vassilvitskii, 06)
35. Worst-case overview Recursive construction:
Start with input X (goes from A to B in T iterations)
Then modify X:
Add 1 dimension, O(k) points, O(1) clusters
Old part of input still goes from A to B in T iterations
New part: Resets everything once
36. Worst-case overview Recursive construction:
Repeat m times:
O(m2) points
O(m) clusters
2m iterations
Lower bound follows:
37. Recursive construction (Overview)
38. Recursive construction (Overview)
39. Recursive construction (Trace: t=0)
40. Recursive construction (Trace: t=0...T)
41. Recursive construction (Trace: t=T+1) Assigning points to clusters
42. Recursive construction (Trace: t=T+1) Assigning points to clusters
43. Recursive construction (Trace: t=T+1) Assigning points to clusters
44. Recursive construction (Trace: t=T+1) Assigning points to clusters
45. Recursive construction (Trace: t=T+1) Recomputing centers
46. Recursive construction (Trace: t=T+1) Recomputing centers
47. Recursive construction (Trace: t=T+2) Assigning points to clusters
48. Recursive construction (Trace: t=T+2) Assigning points to clusters
49. Recursive construction (Trace: t=T+2) Assigning points to clusters
50. Recursive construction (Trace: t=T+2) Recomputing centers
51. Recursive construction (Trace: t=T+2) Recomputing centers
52. Recursive construction (Trace: t=T+3) Assigning points to clusters
53. Recursive construction (Trace: t=T+3) Assigning points to clusters
54. Recursive construction (Trace: t=T+3) Assigning points to clusters
55. Recursive construction (Trace: t=T+3) Recomputing centers
56. Recursive construction (Trace: t=T+3) Recomputing centers
57. Recursive construction (Trace: t=T+4) Assigning points to clusters
58. Recursive construction (Trace: t=T+4) Assigning points to clusters
59. Recursive construction (Trace: t=T+4) Assigning points to clusters
60. Recursive construction (Trace: t=T+4) Recomputing centers
61. Recursive construction (Trace: t=T+4) Recomputing centers
62. Worst-case complexity summary k-means can require iterations
For random centers: even with high probability
Even when d=2 (Vattani, 09)
(d=1 is open)
63. Worst-case complexity summary k-means can require iterations
For random centers: even with high probability
Even when d=2 (Vattani, 09)
(d=1 is open)
ICP:
Can require ?(n/d)d iterations
Similar (but easier) argument
64. Main outline What is k-means?
k-means worst-case complexity
k-means smoothed complexity*
Take an arbitrary input, but randomly perturb it
Expected number of iterations is polynomial
Works for any k, d
k-means++
* k-means has smoothed polynomial complexity (Arthur, Manthey, and Rglin, 09)
65. The problem with worst-case complexity What is the problem?
k-means has bad worst-case complexity
But is not actually slow in practice
Need a different model to understand real world
Simple explanations:
Average case:
Real-world data is not random
66. The problem with worst-case complexity A better explanation:
Smoothed analysis (Spielman and Teng, 01)
Between average case and worst case
Perturb each point by normal distribution, variance s2
Show expected running time is poly in n, D/s
D = diameter of point-set
67. Proof overview Recall the potential function:
X is set of all data points
c(x) is corresponding cluster center
68. Proof overview Recall the potential function:
X is set of all data points
c(x) is corresponding cluster center
Bound f:
f = nD2 initially
Will prove f very likely to drop e2 each iteration
69. Proof overview Recall the potential function:
X is set of all data points
c(x) is corresponding cluster center
Bound f:
f = nD2 initially
Will prove f very likely to drop e2 each iteration
Gives: # of iterations is at most n(D/e)2
70. The easy approach Do union bound over all possible k-means steps
What defines a step?
Original clustering A (= kn choices)
Actually = n3kd choices (Inaba et al., 00)
Resulting clustering B
Total number of possible steps = n6kd
Probability a fixed step can be bad:
Bounded by probability that A and B have near-identical f
Probability = (e/s)d
71. The easy approach The argument:
= P[k-means takes more than n(D/e)2 iterations]
= P[There exists a possible bad step]
= (# of possible steps) P[step is bad]
= n6kd (e/s)d
= small... if e < s (1/n)O(k)
Resulting bound: nO(k) iterations
Not polynomial!
(Arthur and Vassilitskii, 06), (Manthey and Rglin, 09)
72. How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:
73. How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:
74. How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:
75. How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:
76. How can this be improved? A transition blueprint:
Which points switched clusters
Approximate positions for relevant centers
Bonus: Most approximate centers determined by above!
Not obvious facts:
m = number of points switching clusters
# transition blueprints = (nk2)m (D/e)O(m)
P[blueprint is bad] = (e/s)m
(for most blueprints)
77. A good approach The new argument:
= P[k-means takes more than n(D/e)2 iterations]
= P[There exists a possible bad blueprint]
= (# of possible blueprints) P[blueprint is bad]
= (nk2)m (D/e)O(m) (e/s)m
= small... if e < s (s/nD)O(1)
Resulting bound: polynomial iterations!
78. Smoothed complexity summary Smoothed complexity is polynomial
Still have work to do: O(n26)
Getting tight exponents in smoothed analysis is hard
Original theorem for Simplex: O(n96)!
ICP:
Also polynomial smoothed complexity
Much easier argument!
79. Main outline What is k-means?
k-means worst-case complexity
k-means smoothed complexity
k-means++*
Whats wrong with k-means?
Whats k-means++?
O(log k)-competitive with OPT
Experimental results
* k-means++: The advantages of careful seeding (Arthur and Vassilvitskii, 07)
80. Whats wrong with k-means? Recall:
Only finds a local optimum
Local optimum can be arbitrarily bad
i.e., f / fOPT unbounded
Even with high probability in 1 dimension
Even in natural examples: well separated Gaussians
81. Whats wrong with k-means? k-means locally optimizes a clustering
But can miss the big picture!
82. Whats wrong with k-means? k-means locally optimizes a clustering
But can miss the big picture!
83. Whats wrong with k-means? k-means locally optimizes a clustering
But can miss the big picture!
84. The solution Easy way to fix this mistake:
Make centers far apart
85. k-means++ The right way of choosing initial centers
Choose first center uniformly at random
86. k-means++ The right way of choosing initial centers
Choose first center uniformly at random
Repeat until k centers:
Add a new center
Choose x0 with probability
D(x) = distance between x and closest center
87. k-means++ Example:
88. k-means++ Example:
89. k-means++ Example:
90. k-means++ Example:
91. k-means++ Example:
92. Theoretical guarantee Claim:
E[f] = O(log k) fOPT
Guarantee holds as soon as centers picked
But k-means steps good in practice
93. Proof idea Let C1, C2, ..., Ck be OPT clusters
Points in Ci contribute:
fOPT(Ci) to OPT potential
fkm++(Ci) to k-means++ potential
Lemma: If we pick a center from Ci:
E[fkm++(Ci)] = 8fOPT(Ci)
Proof: Linear algebra magic
True for any reasonable probability distribution
94. Proof idea Real danger is we waste center on already covered Ci
Probability of choosing covered cluster:
fcurrent(covered clusters) / fcurrent
= 8fOPT / fcurrent
Cost of choosing one:
If t uncovered clusters left:
1/t fraction of fcurrent now unfixable
Cost = fcurrent / t
Expected cost:
(fOPT / fcurrent ) (fcurrent / t) = fOPT / t
95. Proof idea Cost over all k steps:
fOPT (1/1 + 1/2 + ... + 1/k) = fOPT (log k)
96. k-means++ accuracy improvement
97.
* Values above 1 indicate k-means++ is out-performing k-means
k-means++ vs k-means
98. Other algorithms? Theory community has proposed other reasonable algorithms too
Iterative swapping best in practice (Kanungo et al., 04)
Theoretically O(1)-approximation
Implementation gives this up to be viable in practice
Actual guarantees: None
99. k-means++ accuracy improvement vs Iterative Swapping (Kanungo et al., 04)
100.
* Values above 1 indicate k-means++ is out-performing Iterative Swapping
k-means++ vs Iterative Swapping
101. k-means++ summary Friends dont let friends use vanilla k-means!
k-means++ has provable accuracy guarantee
O(log k)-competitive with OPT
k-means++ is faster on average
k-means++ gets better clusterings (almost) always
102. Main outline Goals:
Understand number of iterations
Harder than you might think!
Worst case = exponential
Smoothed = polynomial
Find better clusterings
k-means++ (modification of k-means)
Provably near-optimal
In practice: faster and better than the competition
103. Special thanks! My advisor:
Rajeev Motwani
My defense committee:
Ashish Goel
Vladlen Koltun
Serge Plotkin
Tim Roughgarden
104. Special thanks! My co-authors:
Bodo Manthey
Rina Panigrahy
Heiko Rglin
Aneesh Sharma
Sergei Vassilvitskii
Ying Xu
105. Special thanks! My other fellow students:
Gagan Aggarwal
Brian Babcock
Bahman Bahmani
Krishnaram Kenthapadi
Aleksandra Korolova
Shubha Nabar
Dilys Thomas
106. Special thanks! And my listeners!