clustering methods n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering methods PowerPoint Presentation
Download Presentation
Clustering methods

Loading in 2 Seconds...

play fullscreen
1 / 23

Clustering methods - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Clustering methods. Partitional clustering in which clusters are represented by their centroids (proc FASTCLUS) Agglomerative hierarchical clustering in which the closest clusters are repeatedly merged (proc CLUSTER)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clustering methods' - echo-branch


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clustering methods
Clustering methods
  • Partitional clustering in which clusters are represented by their centroids (proc FASTCLUS)
  • Agglomerative hierarchical clustering in which the closest clusters are repeatedly merged (proc CLUSTER)
  • Density-based clustering in which core points and associated border points are clustered (proc MODECLUS)

Data mining and statistical learning - lecture 14

proc fastclus
Proc FASTCLUS
  • Select k initial centroids
  • Repeat the following until the clusters remain unchanged:
    • Form k clusters by assigning each point to its nearest centroid
    • Update the centroid of each cluster

Data mining and statistical learning - lecture 14

identification of water samples with incorrect total nitrogen levels
Identification of water samples with incorrecttotal nitrogen levels

Data mining and statistical learning - lecture 14

identification of water samples with incorrect total nitrogen levels 2 means clustering
Identification of water samples with incorrect total nitrogen levels- 2-means clustering

Initialization

problems?

Data mining and statistical learning - lecture 14

limitations of k means clustering
Limitations of K-means clustering
  • Difficult to detect clusters with non-spherical shapes
  • Difficult to detect clusters of widely different sizes
  • Difficult to detect clusters of different densities

Data mining and statistical learning - lecture 14

proc modeclus
Proc MODECLUS
  • Use a smoother to estimate the (local) density of the given dataset
  • A cluster is loosely defined as a region surrounding a local maximum of the probability density function

Data mining and statistical learning - lecture 14

identification of water samples with incorrect total nitrogen levels proc modeclus r 1000
Identification of water samples with incorrecttotal nitrogen levels- proc MODECLUS, R = 1000

What will happen if R is increased?

Data mining and statistical learning - lecture 14

identification of water samples with incorrect total nitrogen levels proc modeclus r 4000
Identification of water samples with incorrecttotal nitrogen levels- proc MODECLUS, R = 4000

Data mining and statistical learning - lecture 14

identification of water samples with incorrect total nitrogen levels proc modeclus method 6
Identification of water samples with incorrecttotal nitrogen levels- proc MODECLUS, method 6

Why did the clustering fail?

Data mining and statistical learning - lecture 14

limitations of density based clustering
Limitations of density-based clustering
  • Difficult to control (requires repeated runs)
  • Collapses in high dimensions

Data mining and statistical learning - lecture 14

strength of density based clustering
Strength of density-based clustering

Given a sufficiently large sample, nonparametric density-based clustering methods are capable of detecting clusters of unequal size and dispersion and with highly irregular shapes

Data mining and statistical learning - lecture 14

identification of water samples with incorrect total nitrogen levels transformed data
Identification of water samples with incorrecttotal nitrogen levels- transformed data

Data mining and statistical learning - lecture 14

slide13
Identification of water samples with incorrecttotal nitrogen levels- proc MODECLUS, R = 2000, transformed data

Data mining and statistical learning - lecture 14

preprocessing
Preprocessing
  • Standardization
  • Linear transformation
  • Dimension reduction

Data mining and statistical learning - lecture 14

postprocessing
Postprocessing
  • Split a cluster
    • Usually, the cluster with the largest SSE is split
  • Introduce a new cluster centroid
    • Often the point that is farthest from any cluster center is chosen
  • Disperse a cluster
    • Remove one centroid and reassign the points to other clusters
  • Merge two clusters
    • Typically, the clusters with the closest centroids are chosen

Data mining and statistical learning - lecture 14

profiling website visitors
Profiling website visitors
  • A total of 296 pages at a Microsoft website are grouped into 13 homogenous categories
    • Initial
    • Support
    • Entertainment
    • Office
    • Windows
    • Othersoft
    • Download
    • …..
  • For each of 32711 visitors we have recorded how many times they have visited the different categories of pages
  • We would like to make a behavioural segmentation of the users ( a cluster analysis) that can be used in future marketing decisions

Data mining and statistical learning - lecture 14

profiling website visitors the dataset
Profiling website visitors- the dataset

Why is it necessary to group the pages into categories?

Data mining and statistical learning - lecture 14

profiling website visitors 10 means clustering
Profiling website visitors- 10-means clustering

Data mining and statistical learning - lecture 14

profiling website visitors cluster proximities
Profiling website visitors- cluster proximities

Data mining and statistical learning - lecture 14

profiling website visitors profiles
Profiling website visitors- profiles

Data mining and statistical learning - lecture 14

profiling website visitors kohonen map of cluster frequencies
Profiling website visitors- Kohonen Map of cluster frequencies

Data mining and statistical learning - lecture 14

profiling website visitors kohonen maps of means by variable and grid cell
Profiling website visitors- Kohonen Maps of means by variable and grid cell

Data mining and statistical learning - lecture 14

characteristics of kohonen maps
Characteristics of Kohonen maps
  • The centroids vary smoothly over the map
    • The set of clusters having unusually large (or small) values of a given variable tend to form connected spatial patterns
  • Clusters with similar centroids need not be close to each other in a Kohonen map
  • The sizes of the clusters in Kohonen maps tend to be less variable than those obtained by K-means clustering

Data mining and statistical learning - lecture 14