- 79 Views
- Uploaded on
- Presentation posted in: General

Chapter 5: Clustering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Chapter 5: Clustering

- Clustering is unsupervised or undirected.
- Unlike classification, in clustering, no pre-classified data.
- Search for groups or clusters of data points (records) that are similar to one another.
- Similar points may mean: similar customers, products, that will behave in similar ways.

- Group points into classes using some distance measures.
- Within-cluster distance, and between cluster distance

- Applications:
- As a stand-alone tool to get insight into data distribution
- As a preprocessing step for other algorithms

- Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
- Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.
- City-planning: Identifying groups of houses according to their house type, value, and geographical location

- Clusters
- Different ways of representing clusters
- Division with boundaries
- Spheres
- Probabilistic
- Dendrograms
- …

1 2 3

I1

I2

…

In

0.5 0.2 0.3

- Clustering quality
- Inter-clusters distance maximized
- Intra-clusters distance minimized

- The quality of a clustering result depends on both the similarity measure used by the method and its application.
- The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
- Clustering vs. classification
- Which one is more difficult? Why?
- There are a huge number of clustering techniques.

- Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j)
- The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
- Weights should be associated with different variables based on applications and data semantics.
- It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.

- Interval-scaled variables
- Binary variables
- Nominal, ordinal, and ratio variables
- Variables of mixed types

- Continuous measurements in a roughly linear scale, e.g., weight, height, temperature, etc
- Standardize data (depending on applications)
- Calculate the mean absolute deviation:
where

- Calculate the standardized measurement (z-score)

- Calculate the mean absolute deviation:

- Distance: Measure the similarity or dissimilarity between two data objects
- Some popular ones include: Minkowski distance:
where (xi1, xi2, …, xip) and(xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

- If q = 1, d is Manhattan distance

- If q = 2, d is Euclidean distance:
- Properties
- d(i,j) 0
- d(i,i)= 0
- d(i,j)= d(j,i)
- d(i,j) d(i,k)+ d(k,j)

- Properties
- Also, one can use weighted distance, and many other similarity/distance measures.

Object j

- A contingency table for binary data
- Simple matching coefficient (invariant, if the binary variable is symmetric):
- Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Object i

- Example
- gender is a symmetric attribute (not used below)
- the remaining attributes are asymmetric attributes
- let the values Y and P be set to 1, and the value N be set to 0

- A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etc
- Method 1: Simple matching
- m: # of matches, p: total # of variables

- Method 2: use a large number of binary variables
- creating a new binary variable for each of the M nominal states

- An ordinal variable can be discrete or continuous
- Order is important, e.g., rank
- Can be treated like interval-scaled (f is a variable)
- replace xif by their ranks
- map the range of each variable onto [0, 1] by replacingi-th object in the f-th variable by
- compute the dissimilarity using methods for interval-scaled variables

- Ratio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population.
- Methods:
- treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted)
- apply logarithmic transformation
yif = log(xif)

- treat them as continuous ordinal data and then treat their ranks as interval-scaled

- A database may contain all six types of variables
- symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio

- One may use a weighted formula to combine their effects
- f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

- f is interval-based: use the normalized distance
- f is ordinal or ratio-scaled
- compute ranks rif and
- and treat zif as interval-scaled

- f is binary or nominal:

- Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
- Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
- Density-based: based on connectivity and density functions
- Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.

- Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
- Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
- Global optimal: exhaustively enumerate all partitions
- Heuristic methods: k-means and k-medoids algorithms
- k-means: Each cluster is represented by the center of the cluster
- k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

- Given k, the k-means algorithm is as follows:
- Choose k cluster centers to coincide with k randomly-chosen points
- Assign each data point to the closest cluster center
- Recompute the cluster centers using the current cluster memberships.
- If a convergence criterion is not met, go to 2).
Typical convergence criteria are: no (or minimal) reassignment of data points to new cluster centers, or minimal decrease in squared error.

p is a point and mi is the mean of cluster Ci

- For simplicity, 1 dimensional data and k=2.
- data: 1, 2, 5, 6,7
- K-means:
- Randomly select 5 and 6 as initial centroids;
- => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5
- => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
- => no change.
- Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5

- Strength:efficient: O(tkn), where n is # data points, k is # clusters, and t is # iterations. Normally, k, t << n.
- Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
- Weakness
- Applicable only when mean is defined, difficult for categorical data
- Need to specify k, the number of clusters, in advance
- Sensitive to noisy data and outliers
- Not suitable to discover clusters with non-convex shapes
- Sensitive to initial seeds

- A few variants of the k-means which differ in
- Selection of the initial k seeds
- Dissimilarity measures
- Strategies to calculate cluster means

- Handling categorical data: k-modes
- Replacing means of clusters with modes
- Using new dissimilarity measures to deal with categorical objects
- Using a frequency based method to update modes of clusters

- k-Means algorithm is sensitive to outliers
- Since an object with an extremely large value may substantially distort the distribution of the data.

- Medoid – the most centrally located point in a cluster, as a representative point of the cluster.
- An example
- In contrast, a centroid is not necessarily inside a cluster.

Initial Medoids

- PAM:
- Given k
- Randomly pick k instances as initial medoids
- Assign each data point to the nearest medoid x
- Calculate the objective function
- the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion)

- Randomly select an point y
- Swap x by y if the swap reduces the objective function
- Repeat (3-6) until no change

Outlier (100 unit away)

- Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?)
- Pam works well for small data sets but does not scale well for large data sets.
- O(k(n-k)2 ) for each change
where n is # of data, k is # of clusters

- O(k(n-k)2 ) for each change

- CLARA: Built in statistical analysis packages, such as S+
- It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
- Strength: deals with larger data sets than PAM
- Weakness:
- Efficiency depends on the sample size
- A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

- There are other scale-up methods e.g., CLARANS

Step 0

Step 1

Step 2

Step 3

Step 4

agglomerative

a

a b

b

a b c d e

c

c d e

d

d e

e

divisive

Step 3

Step 2

Step 1

Step 0

Step 4

- Use distance matrix for clustering. This method does not require the number of clusters k as an input, but needs a termination condition

- At the beginning, each data point forms a cluster (also called a node).
- Merge nodes/clusters that have the least dissimilarity.
- Go on merging
- Eventually all nodes belong to the same cluster

A Dendrogram Shows How the Clusters are Merged Hierarchically

- Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
- A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

- Inverse order of agglomerative clustering
- Eventually each node forms a cluster on its own

- Major weakness of agglomerative clustering methods
- do not scale well: time complexity at least O(n2), where n is the total number of objects
- can never undo what was done previously

- Integration of hierarchical with distance-based clustering to scale-up these clustering methods
- BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
- CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

- Cluster analysis groups objects based on their similarity and has wide applications
- Measure of similarity can be computed for various types of data
- Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, etc
- Clustering can also be used for outlier detection which are useful for fraud detection
- What is the best clustering algorithm?

Other Data Mining Methods

- Market basket analysis analyzes things that happen at the same time.
- How about things happen over time?
E.g., If a customer buys a bed, he/she is likely to come to buy a mattress later

- Sequential analysis needs
- A time stamp for each data record
- customer identification

- The analysis shows which item come before, after or at the same time as other items.
- Sequential patterns can be used for analyzing cause and effect.
Other applications

- Finding cycles in association rules
- Some association rules hold strongly in certain periods of time
- E.g., every Monday people buy item X and Y together

- Stock market predicting
- Predicting possible failure in network, etc

- Holes are empty (sparse) regions in the data space that contain few or no data points. Holes may represent impossible value combinations in the application domain.
- E.g., in a disease database, we may find that certain test values and/or symptoms do not go together, or when certain medicine is used, some test value never go beyond certain range.
- Such information could lead to significant discovery: a cure to a disease or some biological law.

- Data visualization: Use computer graphics effect to reveal the patterns in data,
2-D, 3-D scatter plots, bar charts, pie charts, line plots, animation, etc.

- Pattern visualization: Use good interface and graphics to present the results of data mining.
Rule visualizer, cluster visualizer, etc

- Adapt data mining algorithms to work on very large databases.
- Data reside on hard disk (too large to fit in main memory)
- Make fewer passes over the data

- Quadratic algorithms are too expensive
- Many data mining algorithms are quadratic, especially, clustering algorithms.