chapter 5 clustering
Download
Skip this Video
Download Presentation
Chapter 5: Clustering

Loading in 2 Seconds...

play fullscreen
1 / 40

Chapter 5: Clustering - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Chapter 5: Clustering. Searching for groups. Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre-classified data. Search for groups or clusters of data points (records) that are similar to one another.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chapter 5: Clustering' - dolan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
searching for groups
Searching for groups
  • Clustering is unsupervised or undirected.
  • Unlike classification, in clustering, no pre-classified data.
  • Search for groups or clusters of data points (records) that are similar to one another.
  • Similar points may mean: similar customers, products, that will behave in similar ways.
group similar points together
Group similar points together
  • Group points into classes using some distance measures.
    • Within-cluster distance, and between cluster distance
  • Applications:
    • As a stand-alone tool to get insight into data distribution
    • As a preprocessing step for other algorithms
examples of clustering applications
Examples of Clustering Applications
  • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
  • Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.
  • City-planning: Identifying groups of houses according to their house type, value, and geographical location
concepts of clustering
Concepts of Clustering
  • Clusters
  • Different ways of representing clusters
    • Division with boundaries
    • Spheres
    • Probabilistic
    • Dendrograms

1 2 3

I1

I2

In

0.5 0.2 0.3

clustering
Clustering
  • Clustering quality
    • Inter-clusters distance  maximized
    • Intra-clusters distance  minimized
  • The quality of a clustering result depends on both the similarity measure used by the method and its application.
  • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
  • Clustering vs. classification
    • Which one is more difficult? Why?
    • There are a huge number of clustering techniques.
dissimilarity distance measure
Dissimilarity/Distance Measure
  • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j)
  • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
  • Weights should be associated with different variables based on applications and data semantics.
  • It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.
types of data in clustering analysis
Types of data in clustering analysis
  • Interval-scaled variables
  • Binary variables
  • Nominal, ordinal, and ratio variables
  • Variables of mixed types
interval valued variables
Interval-valued variables
  • Continuous measurements in a roughly linear scale, e.g., weight, height, temperature, etc
  • Standardize data (depending on applications)
    • Calculate the mean absolute deviation:

where

    • Calculate the standardized measurement (z-score)
similarity between objects
Similarity Between Objects
  • Distance: Measure the similarity or dissimilarity between two data objects
  • Some popular ones include: Minkowski distance:

where (xi1, xi2, …, xip) and(xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

  • If q = 1, d is Manhattan distance
similarity between objects cont
Similarity Between Objects (Cont.)
  • If q = 2, d is Euclidean distance:
    • Properties
      • d(i,j) 0
      • d(i,i)= 0
      • d(i,j)= d(j,i)
      • d(i,j) d(i,k)+ d(k,j)
  • Also, one can use weighted distance, and many other similarity/distance measures.
binary variables
Binary Variables

Object j

  • A contingency table for binary data
  • Simple matching coefficient (invariant, if the binary variable is symmetric):
  • Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Object i

dissimilarity of binary variables
Dissimilarity of Binary Variables
  • Example
    • gender is a symmetric attribute (not used below)
    • the remaining attributes are asymmetric attributes
    • let the values Y and P be set to 1, and the value N be set to 0
nominal variables
Nominal Variables
  • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etc
  • Method 1: Simple matching
    • m: # of matches, p: total # of variables
  • Method 2: use a large number of binary variables
    • creating a new binary variable for each of the M nominal states
ordinal variables
Ordinal Variables
  • An ordinal variable can be discrete or continuous
  • Order is important, e.g., rank
  • Can be treated like interval-scaled (f is a variable)
    • replace xif by their ranks
    • map the range of each variable onto [0, 1] by replacingi-th object in the f-th variable by
    • compute the dissimilarity using methods for interval-scaled variables
ratio scaled variables
Ratio-Scaled Variables
  • Ratio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population.
  • Methods:
    • treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted)
    • apply logarithmic transformation

yif = log(xif)

    • treat them as continuous ordinal data and then treat their ranks as interval-scaled
variables of mixed types
Variables of Mixed Types
  • A database may contain all six types of variables
    • symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
  • One may use a weighted formula to combine their effects
    • f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

    • f is interval-based: use the normalized distance
    • f is ordinal or ratio-scaled
      • compute ranks rif and
      • and treat zif as interval-scaled
major clustering techniques
Major Clustering Techniques
  • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
  • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
  • Density-based: based on connectivity and density functions
  • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.
partitioning algorithms basic concept
Partitioning Algorithms: Basic Concept
  • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
  • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
    • Global optimal: exhaustively enumerate all partitions
    • Heuristic methods: k-means and k-medoids algorithms
    • k-means: Each cluster is represented by the center of the cluster
    • k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster
the k means clustering
The K-Means Clustering
  • Given k, the k-means algorithm is as follows:
    • Choose k cluster centers to coincide with k randomly-chosen points
    • Assign each data point to the closest cluster center
    • Recompute the cluster centers using the current cluster memberships.
    • If a convergence criterion is not met, go to 2).

Typical convergence criteria are: no (or minimal) reassignment of data points to new cluster centers, or minimal decrease in squared error.

p is a point and mi is the mean of cluster Ci

example
Example
  • For simplicity, 1 dimensional data and k=2.
  • data: 1, 2, 5, 6,7
  • K-means:
    • Randomly select 5 and 6 as initial centroids;
    • => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5
    • => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
    • => no change.
    • Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5
comments on k means
Comments on K-Means
  • Strength:efficient: O(tkn), where n is # data points, k is # clusters, and t is # iterations. Normally, k, t << n.
  • Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
  • Weakness
    • Applicable only when mean is defined, difficult for categorical data
    • Need to specify k, the number of clusters, in advance
    • Sensitive to noisy data and outliers
    • Not suitable to discover clusters with non-convex shapes
    • Sensitive to initial seeds
variations of the k means method
Variations of the K-Means Method
  • A few variants of the k-means which differ in
    • Selection of the initial k seeds
    • Dissimilarity measures
    • Strategies to calculate cluster means
  • Handling categorical data: k-modes
    • Replacing means of clusters with modes
    • Using new dissimilarity measures to deal with categorical objects
    • Using a frequency based method to update modes of clusters
k medoids clustering method
k-Medoids clustering method
  • k-Means algorithm is sensitive to outliers
    • Since an object with an extremely large value may substantially distort the distribution of the data.
  • Medoid – the most centrally located point in a cluster, as a representative point of the cluster.
  • An example
  • In contrast, a centroid is not necessarily inside a cluster.

Initial Medoids

partition around medoids
Partition Around Medoids
  • PAM:
    • Given k
    • Randomly pick k instances as initial medoids
    • Assign each data point to the nearest medoid x
    • Calculate the objective function
      • the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion)
    • Randomly select an point y
    • Swap x by y if the swap reduces the objective function
    • Repeat (3-6) until no change
comments on pam
Comments on PAM

Outlier (100 unit away)

  • Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?)
  • Pam works well for small data sets but does not scale well for large data sets.
    • O(k(n-k)2 ) for each change

where n is # of data, k is # of clusters

clara clustering large applications
CLARA: Clustering Large Applications
  • CLARA: Built in statistical analysis packages, such as S+
  • It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
  • Strength: deals with larger data sets than PAM
  • Weakness:
    • Efficiency depends on the sample size
    • A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
  • There are other scale-up methods e.g., CLARANS
hierarchical clustering
Step 0

Step 1

Step 2

Step 3

Step 4

agglomerative

a

a b

b

a b c d e

c

c d e

d

d e

e

divisive

Step 3

Step 2

Step 1

Step 0

Step 4

Hierarchical Clustering
  • Use distance matrix for clustering. This method does not require the number of clusters k as an input, but needs a termination condition
agglomerative clustering
Agglomerative Clustering
  • At the beginning, each data point forms a cluster (also called a node).
  • Merge nodes/clusters that have the least dissimilarity.
  • Go on merging
  • Eventually all nodes belong to the same cluster
slide31
A Dendrogram Shows How the Clusters are Merged Hierarchically
  • Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
  • A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
divisive clustering
Divisive Clustering
  • Inverse order of agglomerative clustering
  • Eventually each node forms a cluster on its own
more on hierarchical methods
More on Hierarchical Methods
  • Major weakness of agglomerative clustering methods
    • do not scale well: time complexity at least O(n2), where n is the total number of objects
    • can never undo what was done previously
  • Integration of hierarchical with distance-based clustering to scale-up these clustering methods
    • BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
    • CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction
summary
Summary
  • Cluster analysis groups objects based on their similarity and has wide applications
  • Measure of similarity can be computed for various types of data
  • Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, etc
  • Clustering can also be used for outlier detection which are useful for fraud detection
  • What is the best clustering algorithm?
sequence analysis
Sequence analysis
  • Market basket analysis analyzes things that happen at the same time.
  • How about things happen over time?

E.g., If a customer buys a bed, he/she is likely to come to buy a mattress later

  • Sequential analysis needs
    • A time stamp for each data record
    • customer identification
sequence analysis cont
Sequence analysis (cont …)
  • The analysis shows which item come before, after or at the same time as other items.
  • Sequential patterns can be used for analyzing cause and effect.

Other applications

  • Finding cycles in association rules
    • Some association rules hold strongly in certain periods of time
    • E.g., every Monday people buy item X and Y together
  • Stock market predicting
  • Predicting possible failure in network, etc
discovering holes in data
Discovering holes in data
  • Holes are empty (sparse) regions in the data space that contain few or no data points. Holes may represent impossible value combinations in the application domain.
  • E.g., in a disease database, we may find that certain test values and/or symptoms do not go together, or when certain medicine is used, some test value never go beyond certain range.
  • Such information could lead to significant discovery: a cure to a disease or some biological law.
data and pattern visualization
Data and pattern visualization
  • Data visualization: Use computer graphics effect to reveal the patterns in data,

2-D, 3-D scatter plots, bar charts, pie charts, line plots, animation, etc.

  • Pattern visualization: Use good interface and graphics to present the results of data mining.

Rule visualizer, cluster visualizer, etc

scaling up data mining algorithms
Scaling up data mining algorithms
  • Adapt data mining algorithms to work on very large databases.
    • Data reside on hard disk (too large to fit in main memory)
    • Make fewer passes over the data
  • Quadratic algorithms are too expensive
    • Many data mining algorithms are quadratic, especially, clustering algorithms.
ad