Clustering l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 40

Clustering PowerPoint PPT Presentation


  • 122 Views
  • Updated On :
  • Presentation posted in: General

Clustering. An overview of clustering algorithms Dènis de Keijzer GIA 2004. Overview. Algorithms GRAVIclust AUTOCLUST AUTOCLUST+ 3D Boundary-based Clustering SNN. Gravity based spatial clustering. GRAVIclust Initialisation Phase calculate the initial centre clusters

Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Clustering

An overview of clustering algorithms

Dènis de Keijzer

GIA 2004


Overview

  • Algorithms

    • GRAVIclust

    • AUTOCLUST

    • AUTOCLUST+

    • 3D Boundary-based Clustering

    • SNN


Gravity based spatial clustering

  • GRAVIclust

    • Initialisation Phase

      • calculate the initial centre clusters

    • Optimisation Phase

      • improve the position of the cluster centres so as to achieve a solution which minimizes the distance function


GRAVIclust: Initialisation Phase

  • Input:

    • set of points P


GRAVIclust: Initialisation Phase

  • Input:

    • set of points P

    • matrix of distances between all pairs of points

      • assumption: actual access path distance

      • exists in GIS maps

        • e.g.. http://www.transinfo.qld.gov.au

      • very versatile

        • footpath

        • road map

        • rail map


GRAVIclust: Initialisation Phase

  • Input:

    • set of points P

    • matrix of distances between all pairs of points

    • # of required clusters k


GRAVIclust: Initialisation Phase

  • Step 1:

    • calculate first initial centre

      • the point with the largest number of points within radius r

      • remove first initial centre & all points within radius r from further consideration

    • Step 2:

      • repeat Step 1 until k initial centres have been chosen

    • Step 3:

      • create initial clusters by assigning all points to the closest cluster centre


GRAVIclust: radius calculation

  • Radius r

    • calculated based on the area of the region considered for clustering

    • static radius

      • based on the assumption that all clusters are of the same size

    • dynamic radius

      • recalculated after each initial cluster centre is chosen


GRAVIclust: Static vs. Dynamic

  • Static

    • reduced computation

    • # points within a radius r has to be calculated only once

    • not suitable for problems where the points are separated by large empty areas

  • Dynamic

    • increases computation time

    • ensures the radius is adjusted as the points are removed

  • Differs only when distribution is non-uniform


GRAVIclust: Optimisation Phase

  • Step 1:

    • for each cluster, calculate new centre

      • based on the the point closest to cluster centre of gravity

  • Step 2:

    • re-assign points to new cluster centres

  • Step 3:

    • recalculate distance function

      • never greater than previous

  • Step 4:

    • repeat Step 1 to 3 until value distance function equals previous


GRAVIclust

  • Deterministic

  • Can handle obstacles

  • Monotonic convergence of the distance function to a stable point


AUTOCLUST

  • Definitions


AUTOCLUST

  • Definitions II


AUTOCLUST

  • Phase 1:

    • finding boundaries

  • Phase 2:

    • restoring and re-attaching

  • Phase 3:

    • detecting second-order inconsistency


AUTOCLUST: Phase 1

  • Finding boundaries

    • Calculate

      • Delaunay Diagram

      • for each point pi

        • ShortEdges(pi)

        • LongEdges(pi)

        • OtherEdges(pi)

    • Remove

      • ShortEdges(pi) and LongEdges(pi)


AUTOCLUST: Phase 2

  • Restoring and re-attaching

    • for each point pi where ShortEdges(pi) 

      • Determine a candidate connected component C for pi

        • If there are 2 edges ej = (pi,pj) and ek = (pi,pk) in ShortEdges(pi) with CC[pj] CC[pk], then

          • Compute, for each edge e = (pi,pj)  ShortEdges(pi), the size ||CC[pj]|| and let M = maxe = (pi,pj)  ShortEdges(pi) ||CC[pj]||

          • Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to pi)


AUTOCLUST: Phase 2

  • Restoring and re-attaching

    • for each point pi where ShortEdges(pi) 

      • Determine a candidate connected component C for pi

        • If …

        • Otherwise, let C be the label of the connected component all edges e  ShortEdges(pi) connect pi to


AUTOCLUST: Phase 2

  • Restoring and re-attaching

    • for each point pi where ShortEdges(pi) 

      • Determine a candidate connected component C for pi

      • If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that

        • all edges in OtherEdges(pi) are removed, and

        • only in this case, will pi swap connected components

      • Add all edges e  ShortEdges(pi) that connect to C


AUTOCLUST: Phase 3

  • Detecting second-order inconsistency

    • compute the LocalMean for 2-neighbourhoods

    • remove all edges in N2,G(pi) that are long edges


AUTOCLUST


AUTOCLUST

  • No user supplied arguments

    • eliminates expensive human-based exploration time for finding best-fit arguments

  • Robust to noise, outliers, bridges and type of distribution

  • Able to detect clusters with arbitrary shapes, different sizes and different densities

  • Can handle multiple bridges

  • O(n log n)


AUTOCLUST+

  • Construct Delaunay Diagram

  • Calculate MeanStDev(P)

  • For all edges e, remove e if it intersects some obstacles

  • Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps


3D Boundary-based Clustering

  • Benefits from 3D Clustering

    • more accurate spatial analysis

    • distinguish

      • positive clusters:

        • clusters in higher dimensions but not in lower dimensions


3D Boundary-based Clustering

  • Benefits from 3D Clustering

    • more accurate spatial analysis

    • distinguish

      • positive clusters:

        • clusters in higher dimensions but not in lower dimensions

      • negative clusters:

        • clusters in lower dimensions but not in higher dimensions


3D Boundary-based Clustering

  • Based on AUTOCLUST

  • Uses Delaunay Tetrahedrizations

  • Definitions:

    • ej potential inter-cluster edge if:


3D Boundary-based Clustering

  • Phase I

    • For all the piP, classify each edge ej incident to pi into one of three groups

      • ShortEdges(pi) when the length of ej is less than the range in AI(pi)

      • LongEdges(pi) when the length of ej is greater than the range in AI(pi)

      • OtherEdges(pi) when the length of ej is within AI(pi)

    • For all the piP, remove all edges in ShortEdges(pi) and LongEdges(pi)


3D Boundary-based Clustering

  • Phase II

    • Recuperate ShortEdges(pi) incident to border points using connected component analysis

  • Phase III

    • Remove exceptionally long edges in local regions


Shared Nearest Neighbour

  • Clustering in higher dimensions

    • Distances or similarities between points become more uniform, making clustering more difficult

    • Also, similarity between points can be misleading

      • i.e.. a point can be more similar to a point that “actually” belongs to a different cluster

    • Solution

      • Shared nearest neighbor approach to similarity


SNN: An alternative definition of similarity

  • Euclidian distance

    • most common distance metric used

    • while useful in low dimensions, it doesn’t work well in high dimensions


SNN: An alternative definition of similarity

  • Define similarity in terms of their shared nearest neighbours

    • the similarity of the points is “confirmed” by their common shared nearest neighbours


SNN: An alternative definition ofdensity

  • SNN similarity, with the k-nearest neighbour approach

    • if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point

    • since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space


SNN: Algorithm

  • Compute the similarity matrix

    • corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points


SNN: Algorithm

  • Compute the similarity matrix

  • Sparsify the similarity matrix by keeping only the k most similar neighbours

    • corresponds to keeping only the k strongest links of the similarity graph


SNN: Algorithm

  • Compute the similarity matrix

  • Sparsify the similarity matrix …

  • Construct the shared nearest neighbour graph from the sparsified similarity matrix


SNN: Algorithm

  • Compute the similarity matrix

  • Sparsify the similarity matrix …

  • Construct the shared …

  • Find the SNN density of each point

  • Find the core points


SNN: Algorithm

  • Compute the similarity matrix

  • Sparsify the similarity matrix …

  • Construct the shared …

  • Find the SNN density of each point


SNN: Algorithm

  • Compute the similarity matrix

  • Sparsify the similarity matrix …

  • Construct the shared …

  • Find the SNN density of each point

  • Form clusters from the core points


SNN: Algorithm

  • Compute the similarity matrix

  • Sparsify the similarity matrix …

  • Construct the shared …

  • Find the SNN density of each point

  • Form clusters from the core points

  • Discard all noise points


SNN: Algorithm

  • Compute the similarity matrix

  • Sparsify the similarity matrix …

  • Construct the shared …

  • Find the SNN density of each point

  • Form clusters from the core points

  • Discard all noise points

  • Assign al non-noise, non-core points to clusters


Shared Nearest Neighbour

  • Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers

  • Handles data of high dimentionality and varying densities

  • Automaticly detects the # of clusters


  • Login