remaining lectures in 2009 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Remaining Lectures in 2009 PowerPoint Presentation
Download Presentation
Remaining Lectures in 2009

Loading in 2 Seconds...

play fullscreen
1 / 25

Remaining Lectures in 2009 - PowerPoint PPT Presentation


  • 175 Views
  • Uploaded on

Remaining Lectures in 2009. Advanced Clustering and Outlier Detection Advanced Classification and Prediction Top Ten Data Mining Algorithms (short) Course Summary (short) Assignment5 Student Presentations . Clustering Part2: Advanced Clustering and Outlier Detection .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Remaining Lectures in 2009' - edward


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
remaining lectures in 2009
Remaining Lectures in 2009
  • Advanced Clustering and Outlier Detection
  • Advanced Classification and Prediction
  • Top Ten Data Mining Algorithms (short)
  • Course Summary (short)
  • Assignment5 Student Presentations
clustering part2 advanced clustering and outlier detection
Clustering Part2: Advanced Clustering and Outlier Detection
  • Hierarchical Clustering
  • More on Density-based Clustering: DENCLUE
  • [EM Top10-DM-Alg]
  • Cluster Evaluation Measures
  • Outlier Detection
more on clustering
More on Clustering
  • Hierarchical Clustering to be discussed in Nov. 11
  • DBSCAN will be used in programming project
hierarchical clustering
Hierarchical Clustering
  • Produces a set of nested clusters organized as a hierarchical tree
  • Can be visualized as a dendrogram
    • A tree like diagram that records the sequences of merges or splits
agglomerative clustering algorithm
Agglomerative Clustering Algorithm
  • More popular hierarchical clustering technique
  • Basic algorithm is straightforward
    • Compute the proximity matrix
    • Let each data point be a cluster
    • Repeat
    • Merge the two closest clusters
    • Update the proximity matrix
    • Until only a single cluster remains
  • Key operation is the computation of the proximity of two clusters
    • Different approaches to defining the distance between clusters distinguish the different algorithms
starting situation

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

Starting Situation
  • Start with clusters of individual points and a proximity matrix

Proximity Matrix

intermediate situation

C1

C2

C3

C4

C5

C1

C2

C3

C4

C5

Intermediate Situation
  • After some merging steps, we have some clusters

C3

C4

Proximity Matrix

C1

C5

C2

intermediate situation1

C1

C2

C3

C4

C5

C1

C2

C3

C4

C5

Intermediate Situation
  • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.

C3

C4

Proximity Matrix

C1

C5

C2

after merging
After Merging
  • The question is “How do we update the proximity matrix?”

C2 U C5

C1

C3

C4

C1

?

? ? ? ?

C2 U C5

C3

C3

?

C4

?

C4

Proximity Matrix

C1

C2 U C5

how to define inter cluster similarity

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

How to Define Inter-Cluster Similarity

Similarity?

  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
    • Ward’s Method uses squared error

Proximity Matrix

how to define inter cluster similarity1

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
    • Ward’s Method uses squared error

Proximity Matrix

how to define inter cluster similarity2

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
    • Ward’s Method uses squared error

Proximity Matrix

how to define inter cluster similarity3

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
    • Ward’s Method uses squared error

Proximity Matrix

how to define inter cluster similarity4

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

How to Define Inter-Cluster Similarity

  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
    • Ward’s Method uses squared error

Proximity Matrix

cluster similarity group average

1

2

3

4

5

Cluster Similarity: Group Average
  • Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
  • Need to use average connectivity for scalability since total proximity favors large clusters
2009 teaching of clustering
2009 Teaching of Clustering
  • Clustering Part1: Basics (September/October)
  • What is Clustering?
  • Partitioning/Representative-based Clustering
    • K-means
    • K-medoids
  • Density Based Clustering centering on DBSCAN
  • Region Discovery
  • Grid-based Clustering
  • Similarity Assessment
  • Clustering Part2: Advanced Topics (November)
dbscan http www2 cs uh edu ceick 7363 papers dbscan pdf
DBSCAN (http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf )
  • DBSCAN is a density-based algorithm.
    • Density = number of points within a specified radius (Eps)
    • Input parameter: MinPts and Eps
    • A point is a core point if it has more than a specified number of points (MinPts) within Eps
      • These are points that are at the interior of a cluster
    • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
    • A noise point is any point that is not a core point or a border point.
dbscan algorithm simplified view for teaching
DBSCAN Algorithm (simplified view for teaching)
  • Create a graph whose nodes are the points to be clustered
  • For each core-point c create an edge from c to every point p in the -neighborhood of c
  • Set N to the nodes of the graph;
  • If N does not contain any core points terminate
  • Pick a core point c in N
  • Let X be the set of nodes that can be reached from c by going forward;
    • create a cluster containing X{c}
    • N=N/(X{c})
  • Continue with step 4

Remarks: points that are not assigned to any cluster are outliers;

http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf gives a more efficient implementation by

performing steps 2 and 6 in parallel

dbscan core border and noise points1
DBSCAN: Core, Border and Noise Points

Original Points

Point types: core, border and noise

Eps = 10, MinPts = 4

when dbscan works well

Clusters

When DBSCAN Works Well

Original Points

  • Resistant to Noise
  • Can handle clusters of different shapes and sizes
when dbscan does not work well
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

Problems with

  • Varying densities
  • High-dimensional data

(MinPts=4, Eps=9.12)

assignment3 dataset complex9
Assignment3 Dataset: Complex9

http://www2.cs.uh.edu/~ml_kdd/Complex&Diamond/2DData.htm

Dataset: http://www2.cs.uh.edu/~ml_kdd/Complex&Diamond/Complex9.txt

K-Means in Weka DBSCAN in Weka

dbscan determining eps and minpts
DBSCAN: Determining EPS and MinPts
  • Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance
  • Noise points have the kth nearest neighbor at farther distance
  • So, plot sorted distance of every point to its kth nearest neighbor

Run DBSCAN for Minp=4 and =5

Non-Core-points

Core-points