1 / 40

Clustering

Clustering. An overview of clustering algorithms Dènis de Keijzer GIA 2004. Overview. Algorithms GRAVIclust AUTOCLUST AUTOCLUST+ 3D Boundary-based Clustering SNN. Gravity based spatial clustering. GRAVIclust Initialisation Phase calculate the initial centre clusters

bertille
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

  2. Overview • Algorithms • GRAVIclust • AUTOCLUST • AUTOCLUST+ • 3D Boundary-based Clustering • SNN

  3. Gravity based spatial clustering • GRAVIclust • Initialisation Phase • calculate the initial centre clusters • Optimisation Phase • improve the position of the cluster centres so as to achieve a solution which minimizes the distance function

  4. GRAVIclust: Initialisation Phase • Input: • set of points P

  5. GRAVIclust: Initialisation Phase • Input: • set of points P • matrix of distances between all pairs of points • assumption: actual access path distance • exists in GIS maps • e.g.. http://www.transinfo.qld.gov.au • very versatile • footpath • road map • rail map

  6. GRAVIclust: Initialisation Phase • Input: • set of points P • matrix of distances between all pairs of points • # of required clusters k

  7. GRAVIclust: Initialisation Phase • Step 1: • calculate first initial centre • the point with the largest number of points within radius r • remove first initial centre & all points within radius r from further consideration • Step 2: • repeat Step 1 until k initial centres have been chosen • Step 3: • create initial clusters by assigning all points to the closest cluster centre

  8. GRAVIclust: radius calculation • Radius r • calculated based on the area of the region considered for clustering • static radius • based on the assumption that all clusters are of the same size • dynamic radius • recalculated after each initial cluster centre is chosen

  9. GRAVIclust: Static vs. Dynamic • Static • reduced computation • # points within a radius r has to be calculated only once • not suitable for problems where the points are separated by large empty areas • Dynamic • increases computation time • ensures the radius is adjusted as the points are removed • Differs only when distribution is non-uniform

  10. GRAVIclust: Optimisation Phase • Step 1: • for each cluster, calculate new centre • based on the the point closest to cluster centre of gravity • Step 2: • re-assign points to new cluster centres • Step 3: • recalculate distance function • never greater than previous • Step 4: • repeat Step 1 to 3 until value distance function equals previous

  11. GRAVIclust • Deterministic • Can handle obstacles • Monotonic convergence of the distance function to a stable point

  12. AUTOCLUST • Definitions

  13. AUTOCLUST • Definitions II

  14. AUTOCLUST • Phase 1: • finding boundaries • Phase 2: • restoring and re-attaching • Phase 3: • detecting second-order inconsistency

  15. AUTOCLUST: Phase 1 • Finding boundaries • Calculate • Delaunay Diagram • for each point pi • ShortEdges(pi) • LongEdges(pi) • OtherEdges(pi) • Remove • ShortEdges(pi) and LongEdges(pi)

  16. AUTOCLUST: Phase 2 • Restoring and re-attaching • for each point pi where ShortEdges(pi)  • Determine a candidate connected component C for pi • If there are 2 edges ej = (pi,pj) and ek = (pi,pk) in ShortEdges(pi) with CC[pj] CC[pk], then • Compute, for each edge e = (pi,pj)  ShortEdges(pi), the size ||CC[pj]|| and let M = maxe = (pi,pj)  ShortEdges(pi) ||CC[pj]|| • Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to pi)

  17. AUTOCLUST: Phase 2 • Restoring and re-attaching • for each point pi where ShortEdges(pi)  • Determine a candidate connected component C for pi • If … • Otherwise, let C be the label of the connected component all edges e  ShortEdges(pi) connect pi to

  18. AUTOCLUST: Phase 2 • Restoring and re-attaching • for each point pi where ShortEdges(pi)  • Determine a candidate connected component C for pi • If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that • all edges in OtherEdges(pi) are removed, and • only in this case, will pi swap connected components • Add all edges e  ShortEdges(pi) that connect to C

  19. AUTOCLUST: Phase 3 • Detecting second-order inconsistency • compute the LocalMean for 2-neighbourhoods • remove all edges in N2,G(pi) that are long edges

  20. AUTOCLUST

  21. AUTOCLUST • No user supplied arguments • eliminates expensive human-based exploration time for finding best-fit arguments • Robust to noise, outliers, bridges and type of distribution • Able to detect clusters with arbitrary shapes, different sizes and different densities • Can handle multiple bridges • O(n log n)

  22. AUTOCLUST+ • Construct Delaunay Diagram • Calculate MeanStDev(P) • For all edges e, remove e if it intersects some obstacles • Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps

  23. 3D Boundary-based Clustering • Benefits from 3D Clustering • more accurate spatial analysis • distinguish • positive clusters: • clusters in higher dimensions but not in lower dimensions

  24. 3D Boundary-based Clustering • Benefits from 3D Clustering • more accurate spatial analysis • distinguish • positive clusters: • clusters in higher dimensions but not in lower dimensions • negative clusters: • clusters in lower dimensions but not in higher dimensions

  25. 3D Boundary-based Clustering • Based on AUTOCLUST • Uses Delaunay Tetrahedrizations • Definitions: • ej potential inter-cluster edge if:

  26. 3D Boundary-based Clustering • Phase I • For all the piP, classify each edge ej incident to pi into one of three groups • ShortEdges(pi) when the length of ej is less than the range in AI(pi) • LongEdges(pi) when the length of ej is greater than the range in AI(pi) • OtherEdges(pi) when the length of ej is within AI(pi) • For all the piP, remove all edges in ShortEdges(pi) and LongEdges(pi)

  27. 3D Boundary-based Clustering • Phase II • Recuperate ShortEdges(pi) incident to border points using connected component analysis • Phase III • Remove exceptionally long edges in local regions

  28. Shared Nearest Neighbour • Clustering in higher dimensions • Distances or similarities between points become more uniform, making clustering more difficult • Also, similarity between points can be misleading • i.e.. a point can be more similar to a point that “actually” belongs to a different cluster • Solution • Shared nearest neighbor approach to similarity

  29. SNN: An alternative definition of similarity • Euclidian distance • most common distance metric used • while useful in low dimensions, it doesn’t work well in high dimensions

  30. SNN: An alternative definition of similarity • Define similarity in terms of their shared nearest neighbours • the similarity of the points is “confirmed” by their common shared nearest neighbours

  31. SNN: An alternative definition ofdensity • SNN similarity, with the k-nearest neighbour approach • if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point • since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space

  32. SNN: Algorithm • Compute the similarity matrix • corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points

  33. SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix by keeping only the k most similar neighbours • corresponds to keeping only the k strongest links of the similarity graph

  34. SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared nearest neighbour graph from the sparsified similarity matrix

  35. SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Find the core points

  36. SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point

  37. SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Form clusters from the core points

  38. SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Form clusters from the core points • Discard all noise points

  39. SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Form clusters from the core points • Discard all noise points • Assign al non-noise, non-core points to clusters

  40. Shared Nearest Neighbour • Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers • Handles data of high dimentionality and varying densities • Automaticly detects the # of clusters

More Related