1 / 22

k-Means and DBSCAN

k-Means and DBSCAN. Gyozo Gidofalvi Uppsala Database Laboratory. Announcements. Updated material for assignment 2 on the lab course home page. Posted sign-up sheets for labs and examinations for assignment 2 outside P1321. Posted office hours. k-Means. Input M (set of points)

sari
Download Presentation

k-Means and DBSCAN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. k-Means and DBSCAN Gyozo Gidofalvi Uppsala Database Laboratory

  2. Announcements • Updated material for assignment 2 on the lab course home page. • Posted sign-up sheets for labs and examinations for assignment 2 outside P1321. • Posted office hours Gyozo Gidofalvi

  3. k-Means • Input • M (set of points) • k (number of clusters) • Output • µ1, …, µk(cluster centroids) • k-Means clusters the M point into K clustersby minimizing the squared error function clusters Si; i=1, …, k. µi is the centroid of all xjSi. Gyozo Gidofalvi

  4. k-Means algorithm select (m1 … mK) randomly from M % initial centroids do (µ1 … µK) = (m1 … mK) all clusters Ci = {} for each point p in M % compute cluster membership of p [i] = argminj(dist(µj,p)) % assign p to the corresponding cluster: Ci = Ci {p} end for each cluster Ci% recompute the centroids mi = avg(p in Ci) while exists mi µi% convergence criterion Gyozo Gidofalvi

  5. K-Means on three clusters Gyozo Gidofalvi

  6. I’m feeling Unlucky Bad initial points Gyozo Gidofalvi

  7. kmeans in practice • How to choose initial centroids • select randomly among the data points • generate completely randomly • How to choose k • study the data • run k-Means for different k • measure squared error for each k • Run kmeans many times! • Get many choices of initial points Gyozo Gidofalvi

  8. k-Means iteration step in AmosQL • Calculate point-to-centroid distances: calp2c_distance(…) select p, c, d from Vector of Number p, Vector of Number c, Number d where p in bag({iota(1,10)}) and c in bag({iota(1,10)}) and d = euclid(p,c); • Assign each point to the closest centroid: calc_cluster_assignment(…) groupby((p2c_distances1(…)), #’argminv’); • Recalculate centroids: calc_clust_means(…) groupby(calc_cluster_assignment1(…), #’col_means’); Gyozo Gidofalvi

  9. Transitive closure • tclose is a second order function to explore graphs where the edges are expressed by a transition functionfno tclose(Function fno, Object o)->Bag of Object • fno(o) produces the children of o • tclose applies the transition function fno(o), then fno(fno(o)), then fno(fno(fno(o))), etc until fno returns no new results Gyozo Gidofalvi

  10. Iterate until convergence with tclose in AmosQL create function bagidiv2(Bag of Number b) ->Bag of Bag of Number as (select floor(n/2) from Number n where n in b); create function vecchild_idiv2(Vector of Number vb) ->Bag of Vector of Number as sort(bagidiv2(in(vb))); create function vecconverge_tclose(Bag of Number ib) ->Bag of Vector of Number /* tclose function iterating the bagchild_idiv2 function until convergence */ as select ov from Vector of Number ov where ov in tclose(#'vecchild_idiv2', sort(ib)); Gyozo Gidofalvi

  11. What about this?! Non-spherical clusters Noise Gyozo Gidofalvi

  12. k-Means pros and cons Gyozo Gidofalvi

  13. Questions • Euclidean distance results in spherical clusters • What cluster shape does the Manhattan distance give? • Think of other distance measures too. What cluster shapes will those yield? • Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point • give an approximation of the complexity of the algorithm expressed in K, I, N, and X. • Can the K-means algorithm be parallelized? • How? Gyozo Gidofalvi

  14. DBSCAN • Density Based Spatial Clustering of Applications with Noise • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Gyozo Gidofalvi

  15. e Definitions • e-neigborhood • The e-neigborhood of an object p is the set of objects withine-distance of p • core object An object q is a core objectiffthere are at leastMinPts objects in q’s e-neighbourhood • directly density reachable (ddr) An object p is ddr from qiff q is a core object and p is inside the e­neighbourhood of q p q Gyozo Gidofalvi

  16. q2 q1 q p q p r Reachability and Connectivity • density reachable (dr) An object pis dr from qiff there exists a chain of objects q1 … qns.t.- q1is ddr from q, - q2is ddr from q1, - q3is ddr from … and pis ddr from qn • density connected (dc) pis dc to riff- exist an object qsuch that pis dr from q- and ris dr from q Gyozo Gidofalvi

  17. Recall… • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Gyozo Gidofalvi

  18. p DBSCAN i = 1 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else Ci=P i=i+1 M = M \ P end while M  {} HOW? Gyozo Gidofalvi

  19. Fining density connected componnets • If r is dc to p there exists q, s.t. both p and r are dr from q. i.e., there exists a ddr-chain from q to both r and p and q is a core object. • Recall: tclose is a second order function to explore graphs where the edges are expressed by a transition functionfno. • fno = ddr Gyozo Gidofalvi

  20. Fining dc components in AmosQL • Assuming q is a core object and the a ddr function with the following signature is defined:ddr(Integer q)->Bag of Integer p • Then: create function dc(Integer q)->Bag of Integer as select p from Integer p where p in tclose(#’ddr’, q); Gyozo Gidofalvi

  21. DBSCAN pros and cons Gyozo Gidofalvi

  22. Questions • Why is the dc criterion useful to define a cluster, instead of dr or ddr? • For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)? • Express using only core objects and ddr, which objects will belong to a cluster Gyozo Gidofalvi

More Related