- 118 Views
- Uploaded on
- Presentation posted in: General

Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

An overview of clustering algorithms

Dènis de Keijzer

GIA 2004

- Algorithms
- GRAVIclust
- AUTOCLUST
- AUTOCLUST+
- 3D Boundary-based Clustering
- SNN

- GRAVIclust
- Initialisation Phase
- calculate the initial centre clusters

- Optimisation Phase
- improve the position of the cluster centres so as to achieve a solution which minimizes the distance function

- Initialisation Phase

- Input:
- set of points P

- Input:
- set of points P
- matrix of distances between all pairs of points
- assumption: actual access path distance
- exists in GIS maps
- e.g.. http://www.transinfo.qld.gov.au

- very versatile
- footpath
- road map
- rail map

- Input:
- set of points P
- matrix of distances between all pairs of points
- # of required clusters k

- Step 1:
- calculate first initial centre
- the point with the largest number of points within radius r
- remove first initial centre & all points within radius r from further consideration

- Step 2:
- repeat Step 1 until k initial centres have been chosen

- Step 3:
- create initial clusters by assigning all points to the closest cluster centre

- calculate first initial centre

- Radius r
- calculated based on the area of the region considered for clustering
- static radius
- based on the assumption that all clusters are of the same size

- dynamic radius
- recalculated after each initial cluster centre is chosen

- Static
- reduced computation
- # points within a radius r has to be calculated only once
- not suitable for problems where the points are separated by large empty areas

- Dynamic
- increases computation time
- ensures the radius is adjusted as the points are removed

- Differs only when distribution is non-uniform

- Step 1:
- for each cluster, calculate new centre
- based on the the point closest to cluster centre of gravity

- for each cluster, calculate new centre
- Step 2:
- re-assign points to new cluster centres

- Step 3:
- recalculate distance function
- never greater than previous

- recalculate distance function
- Step 4:
- repeat Step 1 to 3 until value distance function equals previous

- Deterministic
- Can handle obstacles
- Monotonic convergence of the distance function to a stable point

- Definitions

- Definitions II

- Phase 1:
- finding boundaries

- Phase 2:
- restoring and re-attaching

- Phase 3:
- detecting second-order inconsistency

- Finding boundaries
- Calculate
- Delaunay Diagram
- for each point pi
- ShortEdges(pi)
- LongEdges(pi)
- OtherEdges(pi)

- Remove
- ShortEdges(pi) and LongEdges(pi)

- Calculate

- Restoring and re-attaching
- for each point pi where ShortEdges(pi)
- Determine a candidate connected component C for pi
- If there are 2 edges ej = (pi,pj) and ek = (pi,pk) in ShortEdges(pi) with CC[pj] CC[pk], then
- Compute, for each edge e = (pi,pj) ShortEdges(pi), the size ||CC[pj]|| and let M = maxe = (pi,pj) ShortEdges(pi) ||CC[pj]||
- Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to pi)

- If there are 2 edges ej = (pi,pj) and ek = (pi,pk) in ShortEdges(pi) with CC[pj] CC[pk], then

- Determine a candidate connected component C for pi

- for each point pi where ShortEdges(pi)

- Restoring and re-attaching
- for each point pi where ShortEdges(pi)
- Determine a candidate connected component C for pi
- If …
- Otherwise, let C be the label of the connected component all edges e ShortEdges(pi) connect pi to

- Determine a candidate connected component C for pi

- for each point pi where ShortEdges(pi)

- Restoring and re-attaching
- for each point pi where ShortEdges(pi)
- Determine a candidate connected component C for pi
- If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that
- all edges in OtherEdges(pi) are removed, and
- only in this case, will pi swap connected components

- Add all edges e ShortEdges(pi) that connect to C

- for each point pi where ShortEdges(pi)

- Detecting second-order inconsistency
- compute the LocalMean for 2-neighbourhoods
- remove all edges in N2,G(pi) that are long edges

- No user supplied arguments
- eliminates expensive human-based exploration time for finding best-fit arguments

- Robust to noise, outliers, bridges and type of distribution
- Able to detect clusters with arbitrary shapes, different sizes and different densities
- Can handle multiple bridges
- O(n log n)

- Construct Delaunay Diagram
- Calculate MeanStDev(P)
- For all edges e, remove e if it intersects some obstacles
- Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps

- Benefits from 3D Clustering
- more accurate spatial analysis
- distinguish
- positive clusters:
- clusters in higher dimensions but not in lower dimensions

- positive clusters:

- Benefits from 3D Clustering
- more accurate spatial analysis
- distinguish
- positive clusters:
- clusters in higher dimensions but not in lower dimensions

- negative clusters:
- clusters in lower dimensions but not in higher dimensions

- positive clusters:

- Based on AUTOCLUST
- Uses Delaunay Tetrahedrizations
- Definitions:
- ej potential inter-cluster edge if:

- Phase I
- For all the piP, classify each edge ej incident to pi into one of three groups
- ShortEdges(pi) when the length of ej is less than the range in AI(pi)
- LongEdges(pi) when the length of ej is greater than the range in AI(pi)
- OtherEdges(pi) when the length of ej is within AI(pi)

- For all the piP, remove all edges in ShortEdges(pi) and LongEdges(pi)

- For all the piP, classify each edge ej incident to pi into one of three groups

- Phase II
- Recuperate ShortEdges(pi) incident to border points using connected component analysis

- Phase III
- Remove exceptionally long edges in local regions

- Clustering in higher dimensions
- Distances or similarities between points become more uniform, making clustering more difficult
- Also, similarity between points can be misleading
- i.e.. a point can be more similar to a point that “actually” belongs to a different cluster

- Solution
- Shared nearest neighbor approach to similarity

- Euclidian distance
- most common distance metric used
- while useful in low dimensions, it doesn’t work well in high dimensions

- Define similarity in terms of their shared nearest neighbours
- the similarity of the points is “confirmed” by their common shared nearest neighbours

- SNN similarity, with the k-nearest neighbour approach
- if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point
- since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space

- Compute the similarity matrix
- corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points

- Compute the similarity matrix
- Sparsify the similarity matrix by keeping only the k most similar neighbours
- corresponds to keeping only the k strongest links of the similarity graph

- Compute the similarity matrix
- Sparsify the similarity matrix …
- Construct the shared nearest neighbour graph from the sparsified similarity matrix

- Compute the similarity matrix
- Sparsify the similarity matrix …
- Construct the shared …
- Find the SNN density of each point
- Find the core points

- Compute the similarity matrix
- Sparsify the similarity matrix …
- Construct the shared …
- Find the SNN density of each point

- Compute the similarity matrix
- Sparsify the similarity matrix …
- Construct the shared …
- Find the SNN density of each point
- Form clusters from the core points

- Compute the similarity matrix
- Sparsify the similarity matrix …
- Construct the shared …
- Find the SNN density of each point
- Form clusters from the core points
- Discard all noise points

- Compute the similarity matrix
- Sparsify the similarity matrix …
- Construct the shared …
- Find the SNN density of each point
- Form clusters from the core points
- Discard all noise points
- Assign al non-noise, non-core points to clusters

- Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers
- Handles data of high dimentionality and varying densities
- Automaticly detects the # of clusters