1 / 22

Some slide material taken from: Marakas 2003, Han & Kamber, Olson & Shi, SAS Education

DSCI 4520/5240 (DATA MINING). DSCI 4520/5240 Lecture 9 Clustering Analysis. Some slide material taken from: Marakas 2003, Han & Kamber, Olson & Shi, SAS Education. Objectives. Overview of K-Means Unsupervised Clustering. Measures of Similarity and Distance

cady
Download Presentation

Some slide material taken from: Marakas 2003, Han & Kamber, Olson & Shi, SAS Education

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSCI 4520/5240 (DATA MINING) DSCI 4520/5240 Lecture 9 Clustering Analysis Some slide material taken from: Marakas 2003, Han & Kamber, Olson & Shi, SAS Education

  2. Objectives • Overview of K-Means Unsupervised Clustering. • Measures of Similarity and Distance • Clustering using SAS Enterprise Miner

  3. Introduction to Clustering • Cluster: A collection of data objects • Largesimilarity among objects in the same cluster • Dissimilarity among objects in different clusters • Clustering is an Unsupervised Classification technique: no pre-determined classes • Typical applications of clustering: • As a stand-alone analysis, to gain insight on the data • As a pre-processing step for other predictive models

  4. new case new case Unsupervised Classification Training Data Training Data case 1: inputs, ? case 2: inputs, ? case 3: inputs, ? case 4: inputs, ? case 5: inputs, ? case 1: inputs, cluster 1 case 2: inputs,cluster 3case 3: inputs,cluster 2case 4: inputs,cluster 1case 5: inputs,cluster 2 • Unsupervised Classification (Clustering) has an UNKNOWN TARGET

  5. Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

  6. Type of data in clustering analysis • Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types

  7. Interval-valued variables • Standardize data • Calculate the mean absolute deviation: where • Calculate the standardized measurement (z-score) • Using mean absolute deviation is more robust than using standard deviation

  8. Similarity and Dissimilarity Between Objects • Distances are normally used to measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer For q=1, we get the MANHATTAN DISTANCE For q=2, we get the EUCLIDEAN DISTANCE

  9. (U2,V2) (U1,V1) L1 = |U1 - U2|+ |V1 - V2| Similarity and Dissimilarity Between Objects • If q = 1, d is Manhattan distance

  10. (U2,V2) (U1,V1) L2 = ((U1 - U2)2 + (V1 - V2)2)1/2 Similarity and Dissimilarity Between Objects (Cont.) • If q = 2, d is Euclidean distance: • Properties • d(i,j) 0 • d(i,i)= 0 • d(i,j)= d(j,i) • d(i,j) d(i,k)+ d(k,j)

  11. Binary Variables Object j • A contingency table for binary data • Simple matching coefficient (invariant, if the binary variable is symmetric): • Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object i

  12. Dissimilarity between Binary Variables • Example • gender is a symmetric attribute • the remaining attributes are asymmetric binary • let the values Y and P be set to 1, and the value N be set to 0

  13. Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching • m: # of matches, p: total # of variables • Method 2: use a large number of binary variables • creating a new binary variable for each of the M nominal states

  14. Ordinal Variables • An ordinal variable can be discrete or continuous • order is important, e.g., rank • Can be treated like interval-scaled • replacing xif by their rank • map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by • compute the dissimilarity using methods for interval-scaled variables

  15. Similar documents Term j Document Projection H orizon Similar Documents Origin of Term i Vector Space Similarity between documents:The Vector Space Model (VSM) • Document (Text) Classification is made possible through calculations of VSM-based document similarities • The same similarity metric is used by search engines to calculate similarity between query texts and retrieved documents • Every document is represented as a sum vector of its index terms • Cosine of angle between vectors determines relevance:

  16. The K-Means Clustering Method • Given k, the k-means algorithm is implemented in the following steps (Olson & Shi, p. 75): • Select the desired number of clusters k • Select k initial observations as seeds • Calculate average cluster values (Cluster Centroids) over each variable (for the initial iteration, this will simply be the initial seed observations) • Assign each of the other training observations to the cluster with the nearest centroid • Recalculate cluster centroids (averages) based on the assignments from step 4 • Iterate between steps 4 and 5, stop when there are no more new assignments

  17. The K-Means Clustering Method • Example

  18. Comments on the K-Means Method • Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes

  19. Clustering inSAS Enterprise Miner (EM)

  20. The Scenario • The goal is to segment potential customers based on geographic and demographic attributes. • Known attributes include such things as age, income, marital status, gender, and home ownership.

  21. PROSPECT: The scenario • A catalog company periodically purchases lists of prospects from outside sources. They want to design a test mailing to evaluate the potential response rates for several different products. Based on their experience, they know that customer preference for their product depends on geographic and demographic factors. Consequently, they want to segment the prospects into groups that are similar to each other with respect to these attributes. • After the prospects have been segmented, a random sample of prospects within each segment will be mailed one of several offers. The results of the test campaign will allow the analyst to evaluate the potential profit for each segment.

  22. PROSPECT data set

More Related