Chapter 5 clustering l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Chapter 5: Clustering PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Chapter 5: Clustering. Searching for groups. Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre-classified data. Search for groups or clusters of data points (records) that are similar to one another.

Download Presentation

Chapter 5: Clustering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Chapter 5: Clustering

Searching for groups

  • Clustering is unsupervised or undirected.

  • Unlike classification, in clustering, no pre-classified data.

  • Search for groups or clusters of data points (records) that are similar to one another.

  • Similar points may mean: similar customers, products, that will behave in similar ways.

Group similar points together

  • Group points into classes using some distance measures.

    • Within-cluster distance, and between cluster distance

  • Applications:

    • As a stand-alone tool to get insight into data distribution

    • As a preprocessing step for other algorithms

An Illustration

Examples of Clustering Applications

  • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

  • Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.

  • City-planning: Identifying groups of houses according to their house type, value, and geographical location

Concepts of Clustering

  • Clusters

  • Different ways of representing clusters

    • Division with boundaries

    • Spheres

    • Probabilistic

    • Dendrograms

1 2 3




0.5 0.2 0.3


  • Clustering quality

    • Inter-clusters distance  maximized

    • Intra-clusters distance  minimized

  • The quality of a clustering result depends on both the similarity measure used by the method and its application.

  • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

  • Clustering vs. classification

    • Which one is more difficult? Why?

    • There are a huge number of clustering techniques.

Dissimilarity/Distance Measure

  • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j)

  • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

  • Weights should be associated with different variables based on applications and data semantics.

  • It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.

Types of data in clustering analysis

  • Interval-scaled variables

  • Binary variables

  • Nominal, ordinal, and ratio variables

  • Variables of mixed types

Interval-valued variables

  • Continuous measurements in a roughly linear scale, e.g., weight, height, temperature, etc

  • Standardize data (depending on applications)

    • Calculate the mean absolute deviation:


    • Calculate the standardized measurement (z-score)

Similarity Between Objects

  • Distance: Measure the similarity or dissimilarity between two data objects

  • Some popular ones include: Minkowski distance:

    where (xi1, xi2, …, xip) and(xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

  • If q = 1, d is Manhattan distance

Similarity Between Objects (Cont.)

  • If q = 2, d is Euclidean distance:

    • Properties

      • d(i,j) 0

      • d(i,i)= 0

      • d(i,j)= d(j,i)

      • d(i,j) d(i,k)+ d(k,j)

  • Also, one can use weighted distance, and many other similarity/distance measures.

Binary Variables

Object j

  • A contingency table for binary data

  • Simple matching coefficient (invariant, if the binary variable is symmetric):

  • Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Object i

Dissimilarity of Binary Variables

  • Example

    • gender is a symmetric attribute (not used below)

    • the remaining attributes are asymmetric attributes

    • let the values Y and P be set to 1, and the value N be set to 0

Nominal Variables

  • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etc

  • Method 1: Simple matching

    • m: # of matches, p: total # of variables

  • Method 2: use a large number of binary variables

    • creating a new binary variable for each of the M nominal states

Ordinal Variables

  • An ordinal variable can be discrete or continuous

  • Order is important, e.g., rank

  • Can be treated like interval-scaled (f is a variable)

    • replace xif by their ranks

    • map the range of each variable onto [0, 1] by replacingi-th object in the f-th variable by

    • compute the dissimilarity using methods for interval-scaled variables

Ratio-Scaled Variables

  • Ratio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population.

  • Methods:

    • treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted)

    • apply logarithmic transformation

      yif = log(xif)

    • treat them as continuous ordinal data and then treat their ranks as interval-scaled

Variables of Mixed Types

  • A database may contain all six types of variables

    • symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio

  • One may use a weighted formula to combine their effects

    • f is binary or nominal:

      dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

    • f is interval-based: use the normalized distance

    • f is ordinal or ratio-scaled

      • compute ranks rif and

      • and treat zif as interval-scaled

Major Clustering Techniques

  • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion

  • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

  • Density-based: based on connectivity and density functions

  • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.

Partitioning Algorithms: Basic Concept

  • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

  • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

    • Global optimal: exhaustively enumerate all partitions

    • Heuristic methods: k-means and k-medoids algorithms

    • k-means: Each cluster is represented by the center of the cluster

    • k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering

  • Given k, the k-means algorithm is as follows:

    • Choose k cluster centers to coincide with k randomly-chosen points

    • Assign each data point to the closest cluster center

    • Recompute the cluster centers using the current cluster memberships.

    • If a convergence criterion is not met, go to 2).

      Typical convergence criteria are: no (or minimal) reassignment of data points to new cluster centers, or minimal decrease in squared error.

p is a point and mi is the mean of cluster Ci


  • For simplicity, 1 dimensional data and k=2.

  • data: 1, 2, 5, 6,7

  • K-means:

    • Randomly select 5 and 6 as initial centroids;

    • => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5

    • => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6

    • => no change.

    • Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5

Comments on K-Means

  • Strength:efficient: O(tkn), where n is # data points, k is # clusters, and t is # iterations. Normally, k, t << n.

  • Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

  • Weakness

    • Applicable only when mean is defined, difficult for categorical data

    • Need to specify k, the number of clusters, in advance

    • Sensitive to noisy data and outliers

    • Not suitable to discover clusters with non-convex shapes

    • Sensitive to initial seeds

Variations of the K-Means Method

  • A few variants of the k-means which differ in

    • Selection of the initial k seeds

    • Dissimilarity measures

    • Strategies to calculate cluster means

  • Handling categorical data: k-modes

    • Replacing means of clusters with modes

    • Using new dissimilarity measures to deal with categorical objects

    • Using a frequency based method to update modes of clusters

k-Medoids clustering method

  • k-Means algorithm is sensitive to outliers

    • Since an object with an extremely large value may substantially distort the distribution of the data.

  • Medoid – the most centrally located point in a cluster, as a representative point of the cluster.

  • An example

  • In contrast, a centroid is not necessarily inside a cluster.

Initial Medoids

Partition Around Medoids

  • PAM:

    • Given k

    • Randomly pick k instances as initial medoids

    • Assign each data point to the nearest medoid x

    • Calculate the objective function

      • the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion)

    • Randomly select an point y

    • Swap x by y if the swap reduces the objective function

    • Repeat (3-6) until no change

Comments on PAM

Outlier (100 unit away)

  • Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?)

  • Pam works well for small data sets but does not scale well for large data sets.

    • O(k(n-k)2 ) for each change

      where n is # of data, k is # of clusters

CLARA: Clustering Large Applications

  • CLARA: Built in statistical analysis packages, such as S+

  • It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output

  • Strength: deals with larger data sets than PAM

  • Weakness:

    • Efficiency depends on the sample size

    • A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

  • There are other scale-up methods e.g., CLARANS

Step 0

Step 1

Step 2

Step 3

Step 4



a b


a b c d e


c d e


d e



Step 3

Step 2

Step 1

Step 0

Step 4

Hierarchical Clustering

  • Use distance matrix for clustering. This method does not require the number of clusters k as an input, but needs a termination condition

Agglomerative Clustering

  • At the beginning, each data point forms a cluster (also called a node).

  • Merge nodes/clusters that have the least dissimilarity.

  • Go on merging

  • Eventually all nodes belong to the same cluster

A Dendrogram Shows How the Clusters are Merged Hierarchically

  • Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

  • A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Divisive Clustering

  • Inverse order of agglomerative clustering

  • Eventually each node forms a cluster on its own

More on Hierarchical Methods

  • Major weakness of agglomerative clustering methods

    • do not scale well: time complexity at least O(n2), where n is the total number of objects

    • can never undo what was done previously

  • Integration of hierarchical with distance-based clustering to scale-up these clustering methods

    • BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters

    • CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction


  • Cluster analysis groups objects based on their similarity and has wide applications

  • Measure of similarity can be computed for various types of data

  • Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, etc

  • Clustering can also be used for outlier detection which are useful for fraud detection

  • What is the best clustering algorithm?

Other Data Mining Methods

Sequence analysis

  • Market basket analysis analyzes things that happen at the same time.

  • How about things happen over time?

    E.g., If a customer buys a bed, he/she is likely to come to buy a mattress later

  • Sequential analysis needs

    • A time stamp for each data record

    • customer identification

Sequence analysis (cont …)

  • The analysis shows which item come before, after or at the same time as other items.

  • Sequential patterns can be used for analyzing cause and effect.

    Other applications

  • Finding cycles in association rules

    • Some association rules hold strongly in certain periods of time

    • E.g., every Monday people buy item X and Y together

  • Stock market predicting

  • Predicting possible failure in network, etc

Discovering holes in data

  • Holes are empty (sparse) regions in the data space that contain few or no data points. Holes may represent impossible value combinations in the application domain.

  • E.g., in a disease database, we may find that certain test values and/or symptoms do not go together, or when certain medicine is used, some test value never go beyond certain range.

  • Such information could lead to significant discovery: a cure to a disease or some biological law.

Data and pattern visualization

  • Data visualization: Use computer graphics effect to reveal the patterns in data,

    2-D, 3-D scatter plots, bar charts, pie charts, line plots, animation, etc.

  • Pattern visualization: Use good interface and graphics to present the results of data mining.

    Rule visualizer, cluster visualizer, etc

Scaling up data mining algorithms

  • Adapt data mining algorithms to work on very large databases.

    • Data reside on hard disk (too large to fit in main memory)

    • Make fewer passes over the data

  • Quadratic algorithms are too expensive

    • Many data mining algorithms are quadratic, especially, clustering algorithms.

  • Login