- 288 Views
- Uploaded on
- Presentation posted in: General

Analyzing and Improving Local Search: k-means and ICP

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

**1. **Analyzing and Improving Local Search: k-means and ICP David Arthur
Special University Oral Exam
Stanford University

**2. **What is this talk about? Two popular but poorly understood algorithms
Fast in practice
Nobody knows why
Find (highly) sub-optimal solutions
The big questions:
What makes these algorithms fast?
How can the solutions they find be improved?

**3. **What is k-means? Divides point-set into k “tightly-packed” clusters

**4. **What is ICP (Iterative Closest Point)? Finds a subset of A “similar” to B

**5. **Main outline Focus on k-means in talk
Goals:
Understand running time
Harder than you might think!
Worst case = exponential
“Smoothed” = polynomial
Find better clusterings
k-means++ (modification of k-means)
Provably “near”-optimal
In practice: faster and more accurate than the competition

**6. **Main outline What is k-means?
k-means worst-case complexity
k-means smoothed complexity
k-means++

**7. **Main outline What is k-means?
What exactly is being solved?
Why this algorithm?
How does it work?
k-means worst-case complexity
k-means smoothed complexity
k-means++

**8. **The k-means problem Input:
An integer k
A set X of n points in Rd
Task:
Partition the points into k clusters C1, C2,…, Ck
Also choose “centers” c1, c2,…, ck for the clusters

**9. **The k-means problem Input:
An integer k
A set X of n points in Rd
Task:
Partition the points into k clusters C1, C2,…, Ck
Also choose “centers” c1, c2,…, ck for the clusters
Minimize objective function:
where c(x) is the center of the cluster containing x
Similar to variance

**10. **The k-means problem Problem is NP-hard
Even when k = 2 (Drineas et al., 04)
Even when d = 2 (Mahajan et al., 09)
Some 1+e approximation algorithms known:
Example running times:
O(n + kk+2 e –(2d+1)k logk+1(n) logk+1(1/e))
(Har-Peled and Mazumdar, 04)
O(2(k/e)^O(1)dn)
(Kumar et al., 04)
All exponential (or worse) in k

**11. **The k-means problem An example real-world data set:
From the UC-Irvine Machine Learning Repository
Looking to detect malevolent network connections:
n=494,021
k=100
d=38
1+e approximation algorithms too slow for this!

**12. **k-means method The fast way:
k-means method (Lloyd 82, MacQueen 67)
“By far the most popular clustering algorithm used in scientific and industrial applications” (Berkhin, 02)

**13. **k-means method Start with k arbitrary centers {ci}
(In practice: chosen at random from data points)

**14. **k-means method Start with k arbitrary centers {ci}
(In practice: chosen at random from data points)

**15. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci

**16. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci

**17. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci

**18. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci

**19. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci

**20. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**21. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**22. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**23. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**24. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**25. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**26. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**27. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**28. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**29. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**30. **k-means method Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable

**31. **Start with k arbitrary centers {ci}
Assign points to clusters Ci based on closest ci
Set each ci to be the center of mass of Ci
Repeat the last two steps until stable
k-means method

**32. **What is known already Number of iterations:
Finite!
Each iteration decreases f
In practice:
Sub-linear (Duda, 00)
Worst-case:
O(n) (Har-Peled and Sadri, 04)
min(O(n3kd), O(kn)) (Inaba et al., 00)
Very large gap!

**33. **Accuracy:
Only finds a local optimum
Local optimum can be arbitrarily bad
i.e., f / fOPT unbounded
Even with high probability in 1 dimension
Even in “natural” examples: well separated Gaussians What is known already

**34. **Main outline What is k-means?
k-means worst-case complexity*
Number of iterations can be:
Super-polynomial!
k-means smoothed complexity
k-means++
* “How slow is the k-means method?” (Arthur and Vassilvitskii, 06)

**35. **Worst-case overview Recursive construction:
Start with input X (goes from A to B in T iterations)
Then modify X:
Add 1 dimension, O(k) points, O(1) clusters
Old part of input still goes from A to B in T iterations
New part: Resets everything once

**36. **Worst-case overview Recursive construction:
Repeat m times:
O(m2) points
O(m) clusters
2m iterations
Lower bound follows:

**37. **Recursive construction (Overview)

**38. **Recursive construction (Overview)

**39. **Recursive construction (Trace: t=0)

**40. **Recursive construction (Trace: t=0...T)

**41. **Recursive construction (Trace: t=T+1) Assigning points to clusters

**42. **Recursive construction (Trace: t=T+1) Assigning points to clusters

**43. **Recursive construction (Trace: t=T+1) Assigning points to clusters

**44. **Recursive construction (Trace: t=T+1) Assigning points to clusters

**45. **Recursive construction (Trace: t=T+1) Recomputing centers

**46. **Recursive construction (Trace: t=T+1) Recomputing centers

**47. **Recursive construction (Trace: t=T+2) Assigning points to clusters

**48. **Recursive construction (Trace: t=T+2) Assigning points to clusters

**49. **Recursive construction (Trace: t=T+2) Assigning points to clusters

**50. **Recursive construction (Trace: t=T+2) Recomputing centers

**51. **Recursive construction (Trace: t=T+2) Recomputing centers

**52. **Recursive construction (Trace: t=T+3) Assigning points to clusters

**53. **Recursive construction (Trace: t=T+3) Assigning points to clusters

**54. **Recursive construction (Trace: t=T+3) Assigning points to clusters

**55. **Recursive construction (Trace: t=T+3) Recomputing centers

**56. **Recursive construction (Trace: t=T+3) Recomputing centers

**57. **Recursive construction (Trace: t=T+4) Assigning points to clusters

**58. **Recursive construction (Trace: t=T+4) Assigning points to clusters

**59. **Recursive construction (Trace: t=T+4) Assigning points to clusters

**60. **Recursive construction (Trace: t=T+4) Recomputing centers

**61. **Recursive construction (Trace: t=T+4) Recomputing centers

**62. **Worst-case complexity summary k-means can require iterations
For random centers: even with high probability
Even when d=2 (Vattani, 09)
(d=1 is open)

**63. **Worst-case complexity summary k-means can require iterations
For random centers: even with high probability
Even when d=2 (Vattani, 09)
(d=1 is open)
ICP:
Can require ?(n/d)d iterations
Similar (but easier) argument

**64. **Main outline What is k-means?
k-means worst-case complexity
k-means smoothed complexity*
Take an arbitrary input, but randomly perturb it
Expected number of iterations is polynomial
Works for any k, d
k-means++
* “k-means has smoothed polynomial complexity” (Arthur, Manthey, and Röglin, 09)

**65. **The problem with worst-case complexity What is the problem?
k-means has bad worst-case complexity
But is not actually slow in practice
Need a different model to understand real world
Simple explanations:
Average case:
Real-world data is not random

**66. **The problem with worst-case complexity A better explanation:
Smoothed analysis (Spielman and Teng, 01)
Between average case and worst case
Perturb each point by normal distribution, variance s2
Show expected running time is poly in n, D/s
D = diameter of point-set

**67. **Proof overview Recall the potential function:
X is set of all data points
c(x) is corresponding cluster center

**68. **Proof overview Recall the potential function:
X is set of all data points
c(x) is corresponding cluster center
Bound f:
f = nD2 initially
Will prove f very likely to drop e2 each iteration

**69. **Proof overview Recall the potential function:
X is set of all data points
c(x) is corresponding cluster center
Bound f:
f = nD2 initially
Will prove f very likely to drop e2 each iteration
Gives: # of iterations is at most n(D/e)2

**70. **The easy approach Do union bound over all possible k-means steps
What defines a step?
Original clustering A (= kn choices)
Actually = n3kd choices (Inaba et al., 00)
Resulting clustering B
Total number of possible steps = n6kd
Probability a fixed step can be bad:
Bounded by probability that A and B have near-identical f
Probability = (e/s)d

**71. **The easy approach The argument:
= P[k-means takes more than n(D/e)2 iterations]
= P[There exists a possible bad step]
= (# of possible steps) · P[step is bad]
= n6kd · (e/s)d
= small... if e < s · (1/n)O(k)
Resulting bound: nO(k) iterations
Not polynomial!
(Arthur and Vassilitskii, 06), (Manthey and Röglin, 09)

**72. **How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:

**73. **How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:

**74. **How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:

**75. **How can this be improved? Union bound is wasteful!
These two k-means steps can be analyzed together:

**76. **How can this be improved? A transition blueprint:
Which points switched clusters
Approximate positions for relevant centers
Bonus: Most approximate centers determined by above!
Not obvious facts:
m = number of points switching clusters
# transition blueprints = (nk2)m · (D/e)O(m)
P[blueprint is bad] = (e/s)m
(for most blueprints)

**77. **A good approach The new argument:
= P[k-means takes more than n(D/e)2 iterations]
= P[There exists a possible bad blueprint]
= (# of possible blueprints) · P[blueprint is bad]
= (nk2)m · (D/e)O(m) · (e/s)m
= small... if e < s · (s/nD)O(1)
Resulting bound: polynomial iterations!

**78. **Smoothed complexity summary Smoothed complexity is polynomial
Still have work to do: O(n26)
Getting tight exponents in smoothed analysis is hard
Original theorem for Simplex: O(n96)!
ICP:
Also polynomial smoothed complexity
Much easier argument!

**79. **Main outline What is k-means?
k-means worst-case complexity
k-means smoothed complexity
k-means++*
What’s wrong with k-means?
What’s k-means++?
O(log k)-competitive with OPT
Experimental results
* “k-means++: The advantages of careful seeding” (Arthur and Vassilvitskii, 07)

**80. **What’s wrong with k-means? Recall:
Only finds a local optimum
Local optimum can be arbitrarily bad
i.e., f / fOPT unbounded
Even with high probability in 1 dimension
Even in “natural” examples: well separated Gaussians

**81. **What’s wrong with k-means? k-means locally optimizes a clustering
But can miss the big picture!

**82. **What’s wrong with k-means? k-means locally optimizes a clustering
But can miss the big picture!

**83. **What’s wrong with k-means? k-means locally optimizes a clustering
But can miss the big picture!

**84. **The solution Easy way to fix this mistake:
Make centers far apart

**85. **k-means++ The “right” way of choosing initial centers
Choose first center uniformly at random

**86. **k-means++ The “right” way of choosing initial centers
Choose first center uniformly at random
Repeat until k centers:
Add a new center
Choose x0 with probability
D(x) = distance between x and closest center

**87. **k-means++ Example:

**88. **k-means++ Example:

**89. **k-means++ Example:

**90. **k-means++ Example:

**91. **k-means++ Example:

**92. **Theoretical guarantee Claim:
E[f] = O(log k) · fOPT
Guarantee holds as soon as centers picked
But k-means steps good in practice

**93. **Proof idea Let C1, C2, ..., Ck be OPT clusters
Points in Ci contribute:
fOPT(Ci) to OPT potential
fkm++(Ci) to k-means++ potential
Lemma: If we pick a center from Ci:
E[fkm++(Ci)] = 8fOPT(Ci)
Proof: Linear algebra magic
True for any reasonable probability distribution

**94. **Proof idea Real danger is we waste center on already covered Ci
Probability of choosing covered cluster:
fcurrent(covered clusters) / fcurrent
= 8fOPT / fcurrent
Cost of choosing one:
If t uncovered clusters left:
1/t fraction of fcurrent now unfixable
Cost = fcurrent / t
Expected cost:
(fOPT / fcurrent ) · (fcurrent / t) = fOPT / t

**95. **Proof idea Cost over all k steps:
fOPT · (1/1 + 1/2 + ... + 1/k) = fOPT · (log k)

**96. **k-means++ accuracy improvement

**97. **
* Values above 1 indicate k-means++ is out-performing k-means
k-means++ vs k-means

**98. **Other algorithms? Theory community has proposed other reasonable algorithms too
Iterative swapping best in practice (Kanungo et al., 04)
Theoretically O(1)-approximation
Implementation gives this up to be viable in practice
Actual guarantees: None

**99. **k-means++ accuracy improvement vs Iterative Swapping (Kanungo et al., 04)

**100. **
* Values above 1 indicate k-means++ is out-performing Iterative Swapping
k-means++ vs Iterative Swapping

**101. **k-means++ summary Friends don’t let friends use vanilla k-means!
k-means++ has provable accuracy guarantee
O(log k)-competitive with OPT
k-means++ is faster on average
k-means++ gets better clusterings (almost) always

**102. **Main outline Goals:
Understand number of iterations
Harder than you might think!
Worst case = exponential
Smoothed = polynomial
Find better clusterings
k-means++ (modification of k-means)
Provably “near”-optimal
In practice: faster and better than the competition

**103. **Special thanks! My advisor:
Rajeev Motwani
My defense committee:
Ashish Goel
Vladlen Koltun
Serge Plotkin
Tim Roughgarden

**104. **Special thanks! My co-authors:
Bodo Manthey
Rina Panigrahy
Heiko Röglin
Aneesh Sharma
Sergei Vassilvitskii
Ying Xu

**105. **Special thanks! My other fellow students:
Gagan Aggarwal
Brian Babcock
Bahman Bahmani
Krishnaram Kenthapadi
Aleksandra Korolova
Shubha Nabar
Dilys Thomas

**106. **Special thanks! And my listeners!