Partitional clustering
This presentation is the property of its rightful owner.
Sponsored Links
1 / 55

PARTITIONAL CLUSTERING PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

PARTITIONAL CLUSTERING. ACM Student Chapter, Heritage Institute of Technology 17 th February , 2012 SIGKDD Presentation by Megha Nangia J. M. Mansa Koustav Mullick. Why do we cluster? . Clustering results are used: As a stand-alone tool to get insight into data distribution

Download Presentation

PARTITIONAL CLUSTERING

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Partitional clustering

PARTITIONAL CLUSTERING

ACM Student Chapter,

Heritage Institute of Technology

17th February, 2012

SIGKDD Presentation by

MeghaNangia

J. M. Mansa

KoustavMullick


Why do we cluster

Why do we cluster?

Clustering results are used:

As a stand-alone tool to get insight into data distribution

Visualization of clusters may unveil important information

As a preprocessing step for other algorithms

Efficient indexing or compression often relies on clustering


Partitional clustering

What is Cluster Analysis?

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more “similar” (in some sense or another) to each other than to those in other clusters.

Cluster analysis itself is not one specific algorithm. But the general task to be solved is forming similar clusters. It can be achieved by various algorithms.


How do we define similarity

How do we define “similarity”?

Recall that the goal is to group together “similar” data – but what does this mean?

No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art”

The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!


Partitional clustering

Clustering:

Minimize Intra-cluster distance

Maximize Inter-cluster distance


Partitional clustering

Applications:

  • Clustering is a main task of explorative data mining to reduce the size of large data sets. Its a common technique for statistical data analysis used in many fields, including :

  • Machine learning

  • Pattern recognition

  • Image analysis

  • Information retrieval

  • Bioinformatics.

  • Web applications such as social network analysis, grouping of shopping items, search result grouping etc.


Requirements of clustering in data mining

RequirementsofClusteringinDataMining

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Interpretability and usability


Notion of clustering

Notion of clustering:

How many clusters?

Six Clusters

Two Clusters

Four Clusters


Clustering algorithms

Clustering Algorithms:

Clustering algorithms can be categorized

Some of the major algorithms are:

Hierarchical or connectivity based clustering

Partitional clustering (K-means or centroid-based clustering)

Density based

Grid based

Model based


Partitional clustering

Mammals


Partitional clustering1

Partitional Clustering:

  • In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into Voronoi cells.

    • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset


Partitional clustering2

Partitional Clustering :

A Partitional Clustering

Original Points


Hierarchical clustering

Hierarchical Clustering:

Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away.

As such, these algorithms connect "objects" to form "clusters" based on their distance. At different distances, different clusters will form, which can be represented using a dendrogram.

These algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances.

A set of nested clusters organized as a hierarchical tree


Hierarchical clustering1

Hierarchical Clustering:

Traditional Hierarchical Clustering

Traditional Dendrogram

Non-traditional Hierarchical Clustering

Non-traditional Dendrogram


Partitional clustering

Hierarchical Clustering.

Partitional Clustering.


Partitional clustering

Partitioning Algorithms:

Partitioning method: Construct a partition of n objects into a set of Kclusters

Given: a set of objects and the number K

Find: a partition of K clusters that optimizes the chosen partitionin`g criterion

Effective heuristic methods: K-means and K-medoids algorithms


Partitional clustering

Common choices for Similarity/ Distance measures:

Euclidean distance:

City block or Manhattan distance:

Cosine similarity:

Jaccard similarity:


K means clustering

K-means Clustering:

Partitional clustering approach

Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest centroid

Number of clusters, K, must be specified

The basic algorithm is very simple


Partitional clustering

K-Means Algorithm:

START

Choose K

Centroids

Select K points as initial Centroids.

Repeat:

Form k clusters by assigning all points to their

respective closest centroid.

Re-compute the centroid for each cluster

5. Until: The centroids don`t change.

Form k clusters.

Recomputecentroid

YES

Centroids

change

NO

END


Time complexity

Time Complexity

Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors.

Reassigning clusters: O(kn) distance computations, or O(knm).

Computing centroids: Each instance vector gets added once to some centroid: O(nm).

Assume these two steps are each done once for I iterations: O(Iknm).


Partitional clustering

K-means Clustering: Step 1

Algorithm: k-means, Distance Metric: Euclidean Distance


Partitional clustering

K-means Clustering: Step 2

Algorithm: k-means, Distance Metric: Euclidean Distance


Partitional clustering

K-means Clustering: Step 3

Algorithm: k-means, Distance Metric: Euclidean Distance


Partitional clustering

K-means Clustering: Step 4

Algorithm: k-means, Distance Metric: Euclidean Distance


Partitional clustering

k1

k2

k3

K-means Clustering: Step 5

Algorithm: k-means, Distance Metric: Euclidean Distance


K means clustering example 2

K-Means Clustering: Example 2


K means clustering example 21

K-Means Clustering: Example 2


Importance of choosing initial centroids

Importance of Choosing Initial Centroids …


Importance of choosing initial centroids1

Importance of Choosing Initial Centroids …


Two different k means clusterings

Two different K-means Clusterings

Optimal Clustering

Sub-optimal Clustering

Original Points


Solutions to initial centroids problem

Solutions to Initial Centroids Problem

Multiple runs

Helps, but probability is not on your side

Sample and use hierarchical clustering to determine initial centroids

Select more than k initial centroids and then select among these initial centroids

Select most widely separated

Postprocessing

Bisecting K-means

Not as susceptible to initialization issues


Evaluating k means clusters

EvaluatingK-meansClusters

Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest cluster

To get SSE, we square these errors and sum them.

x is a data point in cluster Ciand mi is the representative point for cluster Ci

can show that micorresponds to the center (mean) of the cluster

Given two clusters, we can choose the one with the smallest error

One easy way to reduce SSE is to increase K, the number of clusters

A good clustering with smaller K can have a lower SSE than a poor clustering with higher K


Partitional clustering

  • Strength

    • Relatively efficient: O(ikn), where n is # objects, k is # clusters, and iis # iterations. Normally, k, i << n.

    • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

  • Weakness

    • Applicable only when mean is defined, then what about categorical data?

    • Need to specify k, the number of clusters, in advance

    • Unable to handle noisy data and outliers

    • Not suitable to discover clusters with non-convex shapes

    • Also may give rise to Empty-clusters.


Outliers

Outliers

Outliers are objects that do not belong to any cluster or form clusters of very small cardinality

cluster

outliers


Partitional clustering

Bisecting K-Means:

A variant of k-means, that can produce a partitional or heirarchical clustering.

Which cluster to be picked for bisection ?

Can pick the largest Cluster , or

The cluster With lowest average similarity, or

Cluster with the largest SSE.


Partitional clustering

START

Bisecting K-Means Algorithm:

Initialize clusters

Initialize the list of clusters.

Repeat:

Select a cluster from the list of clusters.

Fori=1 to number_of_iterations

Bisect the cluster using k-means algorithm

End for

Select two clusters having the lowest SSE

Add the two clusters from the bisection to

the list of clusters

9. Until: The list contains k clusters.

Select a cluster

NO

i < no. of iterations

YES

Bisect the cluster.

i++

Add the two bisected clusters, having lowest SSE, to list of clusters

NO

K clusters

YES

END


Bisecting k means example

Bisecting K-means: Example


Partitional clustering

  • Why bisecting K-means works better than regular K-means?

    • –Bisecting K-means tends to produce clusters of relatively uniform size.

    • –Regular K-means tends to produce clusters of widely different sizes.

  • –Bisecting K-means beats Regular K-means in Entropy measurement


Limitations of k means

Limitations of K-means:

K-means has problems when clusters are of differing

Sizes

Densities

Non-globular shapes

K-means has problems when the data contains outliers.


Limitations of k means differing sizes

Limitations of K-means: Differing Sizes

Original Points

K-means (3 Clusters)


Partitional clustering

Limitations of K-means: Differing Density

Original Points

K-means (3 Clusters)


Limitations of k means non globular shapes

Limitations of K-means: Non-globular Shapes

Original Points

K-means (2 Clusters)


Overcoming k means limitations

Overcoming K-means Limitations

Original PointsK-means Clusters

  • One solution is to use many clusters.

    • Find parts of clusters, but need to put together.


Overcoming k means limitations1

Overcoming K-means Limitations

Original PointsK-means Clusters


Overcoming k means limitations2

Overcoming K-means Limitations

Original PointsK-means Clusters


Partitional clustering

K-Medoids Algorithm

What is a medoid?

A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal, i.e, it is a most centrally located point in the cluster.

In contrast to the k-means algorithm, k-medoids chooses datapoints as centers(medoids or exemplars)

The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm.


Partitional clustering

Partitioning around

medoids(PAM) algorithm

1. Initialize: randomly select k of the n data points as the medoids.

2. Associate each data point to the closest medoid.

3. For each medoid m

1. For each non-medoid data point o

1. Swap m and o and compute the total cost of the configuration.

4. Select the configuration with the lowest cost.

5. Repeat steps 2 to 5 until there is no change in the medoid.


Partitional clustering

Demonstration of PAM

Cluster the following set of ten objects into two clusters i.e. k=2.

Consider a data set of ten objects as follows:


Partitional clustering

Distribution of the data


Step 1

Step 1

Initialize k centres. Let us assume c1=(3,4) and c2=(7,4).

So here c1 and c2 are selected as medoids.

Calculating distance so as to associate each data object to its nearest medoid.


Partitional clustering

Then so the clusters become:

Cluster1={(3,4)(2,6)(3,8)(4,7)}

Cluster2={(7,4) (6,2)(6,4)(7,3)(8,5)(7,6)}

The total cost involved is 20


Cluster after step 1

Cluster after step 1

Next, we choose a non-medoid point for each medoid, swap it with the medoid and re-compute the cost. If the cost is optimized, we make it the new medoid and proceed similarly, until there is no change in the medoids.


Partitional clustering

Comments on PAM Algorithm

Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean

Pam works well for small data sets but does not scale well for large data sets.


Partitional clustering

Conclusion:

Partitional clustering is a very efficient and easy to implement clustering method.

It helps us find the global and local optimums.

Some of the heuristic approaches involve the K-means and K-medoid algorithms.

However partitional clustering also suffers from a number of shortcomings:

The performance of the algorithm depends on the initial centroids. So

the algorithm gives no guarantee for an optimal solution.

Choosing poor initial centroids may lead to the generation of empty clusters as well.

The number of clusters need to be determined beforehand.

Does not work well with non-globular clusters.

Some of the above stated drawbacks can be solved using the other popular Clustering approach, such as Hierarchical or density based clustering. Nevertheless the importance of partitional clustering cannot be denied.


  • Login