Efficient algorithms for non parametric clustering with clutter
This presentation is the property of its rightful owner.
Sponsored Links
1 / 48

Efficient Algorithms for Non-Parametric Clustering With Clutter PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Algorithms for Non-Parametric Clustering With Clutter. Weng-Keen Wong Andrew Moore. Problems From the Physical Sciences. Minefield detection (Dasgupta and Raftery 1998). Earthquake faults (Byers and Raftery 1998). Problems From the Physical Sciences. (Pereira 2002).

Download Presentation

Efficient Algorithms for Non-Parametric Clustering With Clutter

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient algorithms for non parametric clustering with clutter

Efficient Algorithms for Non-Parametric Clustering With Clutter

Weng-Keen Wong

Andrew Moore


Problems from the physical sciences

Problems From the Physical Sciences

Minefield detection

(Dasgupta and Raftery 1998)

Earthquake faults

(Byers and Raftery 1998)


Problems from the physical sciences1

Problems From the Physical Sciences

(Pereira 2002)

(Sloan Digital Sky Survey 2000)


A simplified example

A Simplified Example


Clustering with traditional algorithms

Clustering with Traditional Algorithms

Single Linkage Clustering

Mixture of Gaussians with a Uniform Background Component


Clustering with cff

Clustering with CFF

Original Dataset

Cuevas-Febrero-Fraiman


Related work

Related Work

(Dasgupta and Raftery 98)

  • Mixture model approach – mixture of Gaussians for features, Poisson process for clutter

    (Byers and Raftery 98)

  • K-nearest neighbour distances for all points modeled as a mixture of two gamma distributions, one for clutter and one for the features

  • Classify each data point based on which component it was most likely generated from


Outline

Outline

1. Introduction: Clustering and Clutter

2. The Cuevas-Febreiro-Fraiman Algorithm

3. Optimizing Step One of CFF

4. Optimizing Step Two of CFF

5. Results


The cff algorithm step one

The CFF Algorithm Step One

Find the high

density datapoints


The cff algorithm step two

The CFF Algorithm Step Two

  • Cluster the high density points using Single Linkage Clustering

  • Stop when link length > 


The cff algorithm

The CFF Algorithm

  • Originally intended to estimate the number of clusters

  • Can also be used to find clusters against a noisy background


Step one non parametric density estimator

Step One: Non-Parametric Density Estimator

A datapoint is a high

density datapoint if:

The number of

datapoints within a

hypersphere of radius

h is > threshold c


Speeding up the non parametric density estimator

Speeding up the Non-Parametric Density Estimator

  • Addressed in a separate paper (Gray and Moore 2001)

  • Two basic ideas:

    1. Use a dual tree algorithm (Gray and Moore 2000)

    2. Cut search off early without computing exact densities (Moore 2000)


Step two euclidean minimum spanning trees emsts

Step Two: Euclidean Minimum Spanning Trees (EMSTs)

  • Traditional MST algorithms assume you are given all the distances

  • Implies O(N2) memory usage

  • Want to use a Euclidean Minimum Spanning Tree algorithm


Optimizing clustering step

Optimizing Clustering Step

  • Exploit recent results in computational geometry for efficient EMSTs

  • Involves modification to GeoMST2 algorithm by (Narasimhan et al 2000)

  • GeoMST2 is based on Well-Separated Pairwise Decompositions (WSPDs) (Callahan 1995)

  • Our optimizations gain an order of magnitude speedup, especially in higher dimensions


Outline for optimizing step two

Outline for Optimizing Step Two

1. High level overview of GeoMST2

2. Example of a WSPD

3. More detailed description of GeoMST2

4. Our optimizations


Intuition behind geomst2

Intuition behind GeoMST2


Intuition behind geomst21

Intuition behind GeoMST2


High level overview of geomst2

High Level Overview of GeoMST2

1.Create the Well-Separated Pairwise Decomposition

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)


High level overview of geomst21

High Level Overview of GeoMST2

1.Create the Well-Separated Pairwise Decomposition

Each Pair (Ai,Bi) represents a possible edge in the MST

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)


High level overview of geomst22

High Level Overview of GeoMST2

1.Create the Well-Separated Pairwise Decomposition

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)

2.Take the pair (Ai,Bi) that corresponds to the shortest edge

3.If the vertices of that edge are not in the same connected component, add the edge to the MST. Repeat Step 2.


A well separated pair callahan 1995

A Well-Separated Pair (Callahan 1995)

  • Let A and B be point sets in d

  • Let RA and RB be their respective bounding hyper-rectangles

  • Define MargDistance(A,B) to be the minimum distance between RA and RB


A well separated pair cont

A Well-Separated Pair (Cont)

The point sets A and B are considered to be

well-separated if:

MargDistance(A,B)  max{Diam(RA),Diam(RB)}


A well separated pairwise decomposition

A Well-Separated Pairwise Decomposition

Pair #1:

([0],[1])

Pair #2:

([0,1], [2])

Pair #3:

([0,1,2],[3,4])

Pair #4:

([3], [4])

The set of pairs {([0],[1]), ([0,1], [2]), ([0,1,2],[3,4]), ([3], [4])} form a Well-Separated Pairwise Decomposition.


The size of a wspd

The Size of a WSPD

A WSPD

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)

If there are n points, a WSPD can be constructed with O(n) pairs using a fair split tree (Callahan 1995)


High level overview of geomst23

High Level Overview of GeoMST2

1.Create the Well-Separated Pairwise Decomposition

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)

2.Take the pair (Ai,Bi) that corresponds to the shortest edge

3.If the vertices of that edge are not in the same connected component, add the edge to the MST. Repeat Step 2


Bichromatic closest pair distance

Bichromatic Closest Pair Distance

Given two sets (Ai,Bi), the Bichromatic

Closest Pair Distance is the closest distance

from a point in Ai to a point in Bi


High level overview of geomst24

High Level Overview of GeoMST2

1.Create the Well-Separated Pairwise Decomposition

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)

2.Take the pair (Ai,Bi) with the shortest BCP distance

3.If Ai and Bi are not already connected, add the edge to the MST. Repeat Step 2.


Geomst2 example start

GeoMST2 Example Start

Current MST


Geomst2 example iteration 1

GeoMST2 Example Iteration 1

Current MST


Geomst2 example iteration 2

GeoMST2 Example Iteration 2

Current MST


Geomst2 example iteration 3

GeoMST2 Example Iteration 3

Current MST


Geomst2 example iteration 4

GeoMST2 Example Iteration 4

Current MST


High level overview of geomst25

High Level Overview of GeoMST2

1.Create the Well-Separated Pairwise Decomposition

Modification for CFF:

If BCP distance > , terminate

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)

2.Take the pair (Ai,Bi) with the shortest BCP distance

3.If Ai and Bi are not already connected, add the edge to the MST. Repeat Step 2.


Optimizations

Optimizations

  • We don’t need the EMST

  • We just need to cluster all points that are within  distance or less from each other

  • Allows two optimizations to GeoMST2 code


High level overview of geomst26

High Level Overview of GeoMST2

Optimizations take place in Step 1

1.Create the Well-Separated Pairwise Decomposition

(A1,B1)

(A2,B2)

.

.

.

(Am,Bm)

2.Take the pair (Ai,Bi) with the shortest BCP distance

3.If Ai and Bi are not already connected, add the edge to the MST. Repeat Step 2.


Optimization 1 illustration

Optimization 1 Illustration


Optimization 1

Optimization 1

Ignore all links that are > 

  • Every pair (Ai, Bi) in the WSPD becomes an edge unless it joins two already connected components

  • If MargDistance(Ai,Bi) > , then an edge of length  cannot exist between a point in Ai and Bi

  • Don’t include such a pair in the WSPD


Optimization 2 illustration

Optimization 2 Illustration


Optimization 2

Optimization 2

  • Join all elements that are within  distance of each other

  • If the max distance separating the bounding hyper-rectangles of Ai and Bi is  , then join all the points in Ai and Bi if they are not already connected

  • Do not add such a pair (Ai,Bi) to the WSPD


Implications of the optimizations

Implications of the optimizations

  • Reduce the amount of time spent in creating the WSPD

  • Reduce the number of WSPDs, thereby speeding up the GeoMST2 algorithm by reducing the size of the priority queue


Results

Results

  • Ran step two algorithms on subsets of the Sloan Digital Sky Survey

  • Compared Kruskal, GeoMST2, and

    -clustering

  • 7 attributes – 4 colors, 2 sky coordinates, 1 redshift value


Results geomst2 vs clustering vs kruskal in 4d

Results (GeoMST2 vs -Clustering vs Kruskal in 4D)


Results geomst2 vs clustering in 3d

Results (GeoMST2 vs -Clustering in 3D)


Results geomst2 vs clustering in 4d

Results (GeoMST2 vs -Clustering in 4D)


Results change in time as changes for 4d data

Results (Change in Time as  changes for 4D data)


Results increasing dimensions vs time

Results (Increasing Dimensions vs Time


Conclusions

Conclusions

  • -clustering outperforms GeoMST2 by nearly an order of magnitude in higher dimensions

  • Combining the optimizations in both steps will yield an efficient algorithm for clustering against clutter on massive data sets


  • Login