Chapter 5: Clustering

Chapter 5: Clustering

Searching for groups • Clustering is unsupervised or undirected. • Unlike classification, in clustering, no pre-classified data. • Search for groups or clusters of data points (records) that are similar to one another. • Similar points may mean: similar customers, products, that will behave in similar ways.

Group similar points together • Group points into classes using some distance measures. • Within-cluster distance, and between cluster distance • Applications: • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms

An Illustration

Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics. • City-planning: Identifying groups of houses according to their house type, value, and geographical location

Concepts of Clustering • Clusters • Different ways of representing clusters • Division with boundaries • Spheres • Probabilistic • Dendrograms • … 1 2 3 I1 I2 … In 0.5 0.2 0.3

Clustering • Clustering quality • Inter-clusters distance  maximized • Intra-clusters distance  minimized • The quality of a clustering result depends on both the similarity measure used by the method and its application. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns • Clustering vs. classification • Which one is more difficult? Why? • There are a huge number of clustering techniques.

Dissimilarity/Distance Measure • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j) • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.

Types of data in clustering analysis • Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types

Interval-valued variables • Continuous measurements in a roughly linear scale, e.g., weight, height, temperature, etc • Standardize data (depending on applications) • Calculate the mean absolute deviation: where • Calculate the standardized measurement (z-score)

Similarity Between Objects • Distance: Measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: where (xi1, xi2, …, xip) and(xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer • If q = 1, d is Manhattan distance

Similarity Between Objects (Cont.) • If q = 2, d is Euclidean distance: • Properties • d(i,j) 0 • d(i,i)= 0 • d(i,j)= d(j,i) • d(i,j) d(i,k)+ d(k,j) • Also, one can use weighted distance, and many other similarity/distance measures.

Binary Variables Object j • A contingency table for binary data • Simple matching coefficient (invariant, if the binary variable is symmetric): • Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object i

Dissimilarity of Binary Variables • Example • gender is a symmetric attribute (not used below) • the remaining attributes are asymmetric attributes • let the values Y and P be set to 1, and the value N be set to 0

Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etc • Method 1: Simple matching • m: # of matches, p: total # of variables • Method 2: use a large number of binary variables • creating a new binary variable for each of the M nominal states

Ordinal Variables • An ordinal variable can be discrete or continuous • Order is important, e.g., rank • Can be treated like interval-scaled (f is a variable) • replace xif by their ranks • map the range of each variable onto [0, 1] by replacingi-th object in the f-th variable by • compute the dissimilarity using methods for interval-scaled variables

Ratio-Scaled Variables • Ratio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population. • Methods: • treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted) • apply logarithmic transformation yif = log(xif) • treat them as continuous ordinal data and then treat their ranks as interval-scaled

Variables of Mixed Types • A database may contain all six types of variables • symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio • One may use a weighted formula to combine their effects • f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. • f is interval-based: use the normalized distance • f is ordinal or ratio-scaled • compute ranks rif and • and treat zif as interval-scaled

Major Clustering Techniques • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.

Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means: Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering • Given k, the k-means algorithm is as follows: • Choose k cluster centers to coincide with k randomly-chosen points • Assign each data point to the closest cluster center • Recompute the cluster centers using the current cluster memberships. • If a convergence criterion is not met, go to 2). Typical convergence criteria are: no (or minimal) reassignment of data points to new cluster centers, or minimal decrease in squared error. p is a point and mi is the mean of cluster Ci

Example • For simplicity, 1 dimensional data and k=2. • data: 1, 2, 5, 6,7 • K-means: • Randomly select 5 and 6 as initial centroids; • => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 • => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 • => no change. • Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5

Comments on K-Means • Strength:efficient: O(tkn), where n is # data points, k is # clusters, and t is # iterations. Normally, k, t << n. • Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness • Applicable only when mean is defined, difficult for categorical data • Need to specify k, the number of clusters, in advance • Sensitive to noisy data and outliers • Not suitable to discover clusters with non-convex shapes • Sensitive to initial seeds

Variations of the K-Means Method • A few variants of the k-means which differ in • Selection of the initial k seeds • Dissimilarity measures • Strategies to calculate cluster means • Handling categorical data: k-modes • Replacing means of clusters with modes • Using new dissimilarity measures to deal with categorical objects • Using a frequency based method to update modes of clusters

k-Medoids clustering method • k-Means algorithm is sensitive to outliers • Since an object with an extremely large value may substantially distort the distribution of the data. • Medoid – the most centrally located point in a cluster, as a representative point of the cluster. • An example • In contrast, a centroid is not necessarily inside a cluster. Initial Medoids

Partition Around Medoids • PAM: • Given k • Randomly pick k instances as initial medoids • Assign each data point to the nearest medoid x • Calculate the objective function • the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion) • Randomly select an point y • Swap x by y if the swap reduces the objective function • Repeat (3-6) until no change

Comments on PAM Outlier (100 unit away) • Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?) • Pam works well for small data sets but does not scale well for large data sets. • O(k(n-k)2 ) for each change where n is # of data, k is # of clusters

CLARA: Clustering Large Applications • CLARA: Built in statistical analysis packages, such as S+ • It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output • Strength: deals with larger data sets than PAM • Weakness: • Efficiency depends on the sample size • A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased • There are other scale-up methods e.g., CLARANS

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative a a b b a b c d e c c d e d d e e divisive Step 3 Step 2 Step 1 Step 0 Step 4 Hierarchical Clustering • Use distance matrix for clustering. This method does not require the number of clusters k as an input, but needs a termination condition

Agglomerative Clustering • At the beginning, each data point forms a cluster (also called a node). • Merge nodes/clusters that have the least dissimilarity. • Go on merging • Eventually all nodes belong to the same cluster

A Dendrogram Shows How the Clusters are Merged Hierarchically • Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. • A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Divisive Clustering • Inverse order of agglomerative clustering • Eventually each node forms a cluster on its own

More on Hierarchical Methods • Major weakness of agglomerative clustering methods • do not scale well: time complexity at least O(n2), where n is the total number of objects • can never undo what was done previously • Integration of hierarchical with distance-based clustering to scale-up these clustering methods • BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters • CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

Summary • Cluster analysis groups objects based on their similarity and has wide applications • Measure of similarity can be computed for various types of data • Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, etc • Clustering can also be used for outlier detection which are useful for fraud detection • What is the best clustering algorithm?

Other Data Mining Methods

Sequence analysis • Market basket analysis analyzes things that happen at the same time. • How about things happen over time? E.g., If a customer buys a bed, he/she is likely to come to buy a mattress later • Sequential analysis needs • A time stamp for each data record • customer identification

Sequence analysis (cont …) • The analysis shows which item come before, after or at the same time as other items. • Sequential patterns can be used for analyzing cause and effect. Other applications • Finding cycles in association rules • Some association rules hold strongly in certain periods of time • E.g., every Monday people buy item X and Y together • Stock market predicting • Predicting possible failure in network, etc

Discovering holes in data • Holes are empty (sparse) regions in the data space that contain few or no data points. Holes may represent impossible value combinations in the application domain. • E.g., in a disease database, we may find that certain test values and/or symptoms do not go together, or when certain medicine is used, some test value never go beyond certain range. • Such information could lead to significant discovery: a cure to a disease or some biological law.

Data and pattern visualization • Data visualization: Use computer graphics effect to reveal the patterns in data, 2-D, 3-D scatter plots, bar charts, pie charts, line plots, animation, etc. • Pattern visualization: Use good interface and graphics to present the results of data mining. Rule visualizer, cluster visualizer, etc

Scaling up data mining algorithms • Adapt data mining algorithms to work on very large databases. • Data reside on hard disk (too large to fit in main memory) • Make fewer passes over the data • Quadratic algorithms are too expensive • Many data mining algorithms are quadratic, especially, clustering algorithms.

Chapter 5: Clustering