# Clustering - PowerPoint PPT Presentation

1 / 40

Clustering. An overview of clustering algorithms Dènis de Keijzer GIA 2004. Overview. Algorithms GRAVIclust AUTOCLUST AUTOCLUST+ 3D Boundary-based Clustering SNN. Gravity based spatial clustering. GRAVIclust Initialisation Phase calculate the initial centre clusters

## Related searches for Clustering

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

### Clustering

An overview of clustering algorithms

Dènis de Keijzer

GIA 2004

### Overview

• Algorithms

• GRAVIclust

• AUTOCLUST

• AUTOCLUST+

• 3D Boundary-based Clustering

• SNN

### Gravity based spatial clustering

• GRAVIclust

• Initialisation Phase

• calculate the initial centre clusters

• Optimisation Phase

• improve the position of the cluster centres so as to achieve a solution which minimizes the distance function

### GRAVIclust: Initialisation Phase

• Input:

• set of points P

### GRAVIclust: Initialisation Phase

• Input:

• set of points P

• matrix of distances between all pairs of points

• assumption: actual access path distance

• exists in GIS maps

• e.g.. http://www.transinfo.qld.gov.au

• very versatile

• footpath

• rail map

### GRAVIclust: Initialisation Phase

• Input:

• set of points P

• matrix of distances between all pairs of points

• # of required clusters k

### GRAVIclust: Initialisation Phase

• Step 1:

• calculate first initial centre

• the point with the largest number of points within radius r

• remove first initial centre & all points within radius r from further consideration

• Step 2:

• repeat Step 1 until k initial centres have been chosen

• Step 3:

• create initial clusters by assigning all points to the closest cluster centre

• calculated based on the area of the region considered for clustering

• based on the assumption that all clusters are of the same size

• recalculated after each initial cluster centre is chosen

### GRAVIclust: Static vs. Dynamic

• Static

• reduced computation

• # points within a radius r has to be calculated only once

• not suitable for problems where the points are separated by large empty areas

• Dynamic

• increases computation time

• Differs only when distribution is non-uniform

### GRAVIclust: Optimisation Phase

• Step 1:

• for each cluster, calculate new centre

• based on the the point closest to cluster centre of gravity

• Step 2:

• re-assign points to new cluster centres

• Step 3:

• recalculate distance function

• never greater than previous

• Step 4:

• repeat Step 1 to 3 until value distance function equals previous

### GRAVIclust

• Deterministic

• Can handle obstacles

• Monotonic convergence of the distance function to a stable point

• Definitions

### AUTOCLUST

• Definitions II

### AUTOCLUST

• Phase 1:

• finding boundaries

• Phase 2:

• restoring and re-attaching

• Phase 3:

• detecting second-order inconsistency

### AUTOCLUST: Phase 1

• Finding boundaries

• Calculate

• Delaunay Diagram

• for each point pi

• ShortEdges(pi)

• LongEdges(pi)

• OtherEdges(pi)

• Remove

• ShortEdges(pi) and LongEdges(pi)

### AUTOCLUST: Phase 2

• Restoring and re-attaching

• for each point pi where ShortEdges(pi) 

• Determine a candidate connected component C for pi

• If there are 2 edges ej = (pi,pj) and ek = (pi,pk) in ShortEdges(pi) with CC[pj] CC[pk], then

• Compute, for each edge e = (pi,pj)  ShortEdges(pi), the size ||CC[pj]|| and let M = maxe = (pi,pj)  ShortEdges(pi) ||CC[pj]||

• Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to pi)

### AUTOCLUST: Phase 2

• Restoring and re-attaching

• for each point pi where ShortEdges(pi) 

• Determine a candidate connected component C for pi

• If …

• Otherwise, let C be the label of the connected component all edges e  ShortEdges(pi) connect pi to

### AUTOCLUST: Phase 2

• Restoring and re-attaching

• for each point pi where ShortEdges(pi) 

• Determine a candidate connected component C for pi

• If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that

• all edges in OtherEdges(pi) are removed, and

• only in this case, will pi swap connected components

• Add all edges e  ShortEdges(pi) that connect to C

### AUTOCLUST: Phase 3

• Detecting second-order inconsistency

• compute the LocalMean for 2-neighbourhoods

• remove all edges in N2,G(pi) that are long edges

### AUTOCLUST

• No user supplied arguments

• eliminates expensive human-based exploration time for finding best-fit arguments

• Robust to noise, outliers, bridges and type of distribution

• Able to detect clusters with arbitrary shapes, different sizes and different densities

• Can handle multiple bridges

• O(n log n)

### AUTOCLUST+

• Construct Delaunay Diagram

• Calculate MeanStDev(P)

• For all edges e, remove e if it intersects some obstacles

• Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps

### 3D Boundary-based Clustering

• Benefits from 3D Clustering

• more accurate spatial analysis

• distinguish

• positive clusters:

• clusters in higher dimensions but not in lower dimensions

### 3D Boundary-based Clustering

• Benefits from 3D Clustering

• more accurate spatial analysis

• distinguish

• positive clusters:

• clusters in higher dimensions but not in lower dimensions

• negative clusters:

• clusters in lower dimensions but not in higher dimensions

### 3D Boundary-based Clustering

• Based on AUTOCLUST

• Uses Delaunay Tetrahedrizations

• Definitions:

• ej potential inter-cluster edge if:

### 3D Boundary-based Clustering

• Phase I

• For all the piP, classify each edge ej incident to pi into one of three groups

• ShortEdges(pi) when the length of ej is less than the range in AI(pi)

• LongEdges(pi) when the length of ej is greater than the range in AI(pi)

• OtherEdges(pi) when the length of ej is within AI(pi)

• For all the piP, remove all edges in ShortEdges(pi) and LongEdges(pi)

### 3D Boundary-based Clustering

• Phase II

• Recuperate ShortEdges(pi) incident to border points using connected component analysis

• Phase III

• Remove exceptionally long edges in local regions

### Shared Nearest Neighbour

• Clustering in higher dimensions

• Distances or similarities between points become more uniform, making clustering more difficult

• Also, similarity between points can be misleading

• i.e.. a point can be more similar to a point that “actually” belongs to a different cluster

• Solution

• Shared nearest neighbor approach to similarity

### SNN: An alternative definition of similarity

• Euclidian distance

• most common distance metric used

• while useful in low dimensions, it doesn’t work well in high dimensions

### SNN: An alternative definition of similarity

• Define similarity in terms of their shared nearest neighbours

• the similarity of the points is “confirmed” by their common shared nearest neighbours

### SNN: An alternative definition ofdensity

• SNN similarity, with the k-nearest neighbour approach

• if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point

• since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space

### SNN: Algorithm

• Compute the similarity matrix

• corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points

### SNN: Algorithm

• Compute the similarity matrix

• Sparsify the similarity matrix by keeping only the k most similar neighbours

• corresponds to keeping only the k strongest links of the similarity graph

### SNN: Algorithm

• Compute the similarity matrix

• Sparsify the similarity matrix …

• Construct the shared nearest neighbour graph from the sparsified similarity matrix

### SNN: Algorithm

• Compute the similarity matrix

• Sparsify the similarity matrix …

• Construct the shared …

• Find the SNN density of each point

• Find the core points

### SNN: Algorithm

• Compute the similarity matrix

• Sparsify the similarity matrix …

• Construct the shared …

• Find the SNN density of each point

### SNN: Algorithm

• Compute the similarity matrix

• Sparsify the similarity matrix …

• Construct the shared …

• Find the SNN density of each point

• Form clusters from the core points

### SNN: Algorithm

• Compute the similarity matrix

• Sparsify the similarity matrix …

• Construct the shared …

• Find the SNN density of each point

• Form clusters from the core points

### SNN: Algorithm

• Compute the similarity matrix

• Sparsify the similarity matrix …

• Construct the shared …

• Find the SNN density of each point

• Form clusters from the core points