Clustering

1 / 40

# Clustering - PowerPoint PPT Presentation

Clustering. An overview of clustering algorithms Dènis de Keijzer GIA 2004. Overview. Algorithms GRAVIclust AUTOCLUST AUTOCLUST+ 3D Boundary-based Clustering SNN. Gravity based spatial clustering. GRAVIclust Initialisation Phase calculate the initial centre clusters

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Clustering' - bertille

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Clustering

An overview of clustering algorithms

Dènis de Keijzer

GIA 2004

Overview
• Algorithms
• GRAVIclust
• AUTOCLUST
• AUTOCLUST+
• 3D Boundary-based Clustering
• SNN
Gravity based spatial clustering
• GRAVIclust
• Initialisation Phase
• calculate the initial centre clusters
• Optimisation Phase
• improve the position of the cluster centres so as to achieve a solution which minimizes the distance function
GRAVIclust: Initialisation Phase
• Input:
• set of points P
• matrix of distances between all pairs of points
• assumption: actual access path distance
• exists in GIS maps
• e.g.. http://www.transinfo.qld.gov.au
• very versatile
• footpath
• rail map
GRAVIclust: Initialisation Phase
• Input:
• set of points P
• matrix of distances between all pairs of points
• # of required clusters k
GRAVIclust: Initialisation Phase
• Step 1:
• calculate first initial centre
• the point with the largest number of points within radius r
• remove first initial centre & all points within radius r from further consideration
• Step 2:
• repeat Step 1 until k initial centres have been chosen
• Step 3:
• create initial clusters by assigning all points to the closest cluster centre
• calculated based on the area of the region considered for clustering
• based on the assumption that all clusters are of the same size
• recalculated after each initial cluster centre is chosen
GRAVIclust: Static vs. Dynamic
• Static
• reduced computation
• # points within a radius r has to be calculated only once
• not suitable for problems where the points are separated by large empty areas
• Dynamic
• increases computation time
• Differs only when distribution is non-uniform
GRAVIclust: Optimisation Phase
• Step 1:
• for each cluster, calculate new centre
• based on the the point closest to cluster centre of gravity
• Step 2:
• re-assign points to new cluster centres
• Step 3:
• recalculate distance function
• never greater than previous
• Step 4:
• repeat Step 1 to 3 until value distance function equals previous
GRAVIclust
• Deterministic
• Can handle obstacles
• Monotonic convergence of the distance function to a stable point
AUTOCLUST
• Definitions
AUTOCLUST
• Definitions II
AUTOCLUST
• Phase 1:
• finding boundaries
• Phase 2:
• restoring and re-attaching
• Phase 3:
• detecting second-order inconsistency
AUTOCLUST: Phase 1
• Finding boundaries
• Calculate
• Delaunay Diagram
• for each point pi
• ShortEdges(pi)
• LongEdges(pi)
• OtherEdges(pi)
• Remove
• ShortEdges(pi) and LongEdges(pi)
AUTOCLUST: Phase 2
• Restoring and re-attaching
• for each point pi where ShortEdges(pi) 
• Determine a candidate connected component C for pi
• If there are 2 edges ej = (pi,pj) and ek = (pi,pk) in ShortEdges(pi) with CC[pj] CC[pk], then
• Compute, for each edge e = (pi,pj)  ShortEdges(pi), the size ||CC[pj]|| and let M = maxe = (pi,pj)  ShortEdges(pi) ||CC[pj]||
• Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to pi)
AUTOCLUST: Phase 2
• Restoring and re-attaching
• for each point pi where ShortEdges(pi) 
• Determine a candidate connected component C for pi
• If …
• Otherwise, let C be the label of the connected component all edges e  ShortEdges(pi) connect pi to
AUTOCLUST: Phase 2
• Restoring and re-attaching
• for each point pi where ShortEdges(pi) 
• Determine a candidate connected component C for pi
• If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that
• all edges in OtherEdges(pi) are removed, and
• only in this case, will pi swap connected components
• Add all edges e  ShortEdges(pi) that connect to C
AUTOCLUST: Phase 3
• Detecting second-order inconsistency
• compute the LocalMean for 2-neighbourhoods
• remove all edges in N2,G(pi) that are long edges
AUTOCLUST
• No user supplied arguments
• eliminates expensive human-based exploration time for finding best-fit arguments
• Robust to noise, outliers, bridges and type of distribution
• Able to detect clusters with arbitrary shapes, different sizes and different densities
• Can handle multiple bridges
• O(n log n)
AUTOCLUST+
• Construct Delaunay Diagram
• Calculate MeanStDev(P)
• For all edges e, remove e if it intersects some obstacles
• Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps
3D Boundary-based Clustering
• Benefits from 3D Clustering
• more accurate spatial analysis
• distinguish
• positive clusters:
• clusters in higher dimensions but not in lower dimensions
3D Boundary-based Clustering
• Benefits from 3D Clustering
• more accurate spatial analysis
• distinguish
• positive clusters:
• clusters in higher dimensions but not in lower dimensions
• negative clusters:
• clusters in lower dimensions but not in higher dimensions
3D Boundary-based Clustering
• Based on AUTOCLUST
• Uses Delaunay Tetrahedrizations
• Definitions:
• ej potential inter-cluster edge if:
3D Boundary-based Clustering
• Phase I
• For all the piP, classify each edge ej incident to pi into one of three groups
• ShortEdges(pi) when the length of ej is less than the range in AI(pi)
• LongEdges(pi) when the length of ej is greater than the range in AI(pi)
• OtherEdges(pi) when the length of ej is within AI(pi)
• For all the piP, remove all edges in ShortEdges(pi) and LongEdges(pi)
3D Boundary-based Clustering
• Phase II
• Recuperate ShortEdges(pi) incident to border points using connected component analysis
• Phase III
• Remove exceptionally long edges in local regions
Shared Nearest Neighbour
• Clustering in higher dimensions
• Distances or similarities between points become more uniform, making clustering more difficult
• Also, similarity between points can be misleading
• i.e.. a point can be more similar to a point that “actually” belongs to a different cluster
• Solution
• Shared nearest neighbor approach to similarity
SNN: An alternative definition of similarity
• Euclidian distance
• most common distance metric used
• while useful in low dimensions, it doesn’t work well in high dimensions
SNN: An alternative definition of similarity
• Define similarity in terms of their shared nearest neighbours
• the similarity of the points is “confirmed” by their common shared nearest neighbours
SNN: An alternative definition ofdensity
• SNN similarity, with the k-nearest neighbour approach
• if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point
• since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space
SNN: Algorithm
• Compute the similarity matrix
• corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points
SNN: Algorithm
• Compute the similarity matrix
• Sparsify the similarity matrix by keeping only the k most similar neighbours
• corresponds to keeping only the k strongest links of the similarity graph
SNN: Algorithm
• Compute the similarity matrix
• Sparsify the similarity matrix …
• Construct the shared nearest neighbour graph from the sparsified similarity matrix
SNN: Algorithm
• Compute the similarity matrix
• Sparsify the similarity matrix …
• Construct the shared …
• Find the SNN density of each point
• Find the core points
SNN: Algorithm
• Compute the similarity matrix
• Sparsify the similarity matrix …
• Construct the shared …
• Find the SNN density of each point
SNN: Algorithm
• Compute the similarity matrix
• Sparsify the similarity matrix …
• Construct the shared …
• Find the SNN density of each point
• Form clusters from the core points
SNN: Algorithm
• Compute the similarity matrix
• Sparsify the similarity matrix …
• Construct the shared …
• Find the SNN density of each point
• Form clusters from the core points