clustering and visual data analysis n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering and Visual Data Analysis PowerPoint Presentation
Download Presentation
Clustering and Visual Data Analysis

Loading in 2 Seconds...

play fullscreen
1 / 24

Clustering and Visual Data Analysis - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Clustering and Visual Data Analysis. Ata Kaban The University of Birmingham 2005. The Clustering Problem. Unsupervised Learning. Data (input). ‘Interesting’ structure (output). Interesting: contains essential characteristics discards unessential details

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clustering and Visual Data Analysis' - lucine


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clustering and visual data analysis

Clustering and Visual Data Analysis

Ata Kaban

The University of Birmingham

2005

the clustering problem
The Clustering Problem

Unsupervised Learning

Data (input)

‘Interesting’ structure (output)

  • Interesting:
  • contains essential characteristics
  • discards unessential details
  • provides a summary the data (e.g. to visualise on the screen)
  • compact
  • interpretable for humans
  • etc.

Objective function that expresses our notion of interestingness for this data

one reason for clustering of data
One reason for clustering of data
  • Here is some data
  • Assume you transmit the coordinates of points drawn randomly from this data set
  • You are only allowed to send a small (say 2 or 3) bits per point
  • So it will be a lossy transmission
  • Loss = sum of squared errors between the original and the decoded coordinates
  • What encoder / decoder will loose the least information?
formalising
Formalising
  • What objective does K-means optimise?
  • Given an encoder function ENC:RT{1..K}

(T is dimension of data, K is number of clusters)

  • Given a decoder function DEC:{1..K}RT
  • DISTORTION:=sum n{xn-DEC[ENC(xn)]}2

where DEC(k)=μk are centers of clusters, k=1..K

So, DISTORTION=sumn{xn-μENC(xn)}2, where n goes from 1 to N, the number of points

the minimal distortion
The minimal distortion

DISTORTION=sumn{xn-μENC(x_n)}2

This is minimised.

What properties do μ1,…μK satisfy for that?

1) each point xn must be encoded by its nearest center, otherwise DISTORTION could be reduced by replacing ENC(xn) with the nearest center of xn.

2) each μk must be the centroid of its own points

slide12
If N is the known number of points and K the desired number of clusters, the K-means algorithm is:

Begin

initialize 1, 2, …,K (randomly selected)

do classify n samples according to nearest i

recompute i

until no change in i

return 1, 2, …, K

End

other forms of clustering
Other forms of clustering
  • Many times, clusters are not disjoint, but a cluster may have subclusters, in turn having sub-subclusters.
  • Hierarchical clustering
slide17
Given any two samples x and x’, they will be grouped together at some level, and if they are grouped a level k, they remain grouped for all higher levels
  • Hierarchical clustering  tree representation called dendrogram
slide18
The similarity values may help to determine if the grouping are natural or forced, but if they are evenly distributed no information can be gained
  • Another representation is based on set, e.g., on the Venn diagrams
slide19
Hierarchical clustering can be divided in agglomerative and divisive.
  • Agglomerative (bottom up, clumping): start with n singleton cluster and form the sequence by merging clusters
  • Divisive (top down, splitting): start with all of the samples in one cluster and form the sequence by successively splitting clusters
slide20
Agglomerative hierarchical clustering
  • The procedure terminates when the specified number of cluster has been obtained, and returns the cluster as sets of points, rather than the mean or a representative vector for each cluster
the problem of the number of clusters
The problem of the number of clusters
  • Typically, the number of clusters is known.
  • When it’s not, that is a hard problem called model selection. There are several ways of proceed.
  • A common approach is to repeat the clustering with c=1, c=2, c=3, etc.
what did we learn today
What did we learn today?
  • Data clustering
  • K-means algorithm in detail
  • How K-means can get stuck and how to take care of that
  • The outline of Hierarchical clustering methods
slide23

Pattern ClassificationFind out more here!Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000

slide24

Pattern ClassificationFind out more here!Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000