Clustering of Non-Quantitative Data

Nanjing University of Science & Technology Clustering of Non-Quantitative Data Lonnie C. Ludeman New Mexico State University Las Cruces, New Mexico, USA Nov 23, 2005

Like to Thank Nanjing University of Science & Technology Department of Computer Science especially Lu Jian-feng Yang Jing-yu Wang Han for inviting me to come to NUST

Also like to Thank two special students Wang Qiong Wang Huan for their helpfulness and kindness during my tenure at NUST

Consider the problem of clustering a standard american deck of playing cards Two Possible Solutions

Two More Possible Solutions Clustering of Non-Quantitative Data

Clusteringis the art of grouping together pattern vectors that in some sense belong together because they have similar characteristics and are somehow different from other pattern vectors. In the most general problem the number of clusters or subgroups is unknown as are the properties that make them similar.

Lecture Topics • General Concept of Clustering • Define Types of Data • Review the K-Means Clustering Algorithm • Describe Clustering Method for non- Quantitative data • Present two examples illustrating the method • Discuss advantages and disadvantages of the method

Mathematical Formalization of the ClusteringProblem Given a set S of NS n-dimensional pattern vectors: S= { xj ; j =1, 2, ... , NS } Clustering is the process of partitioning S into M subsets, Clk , k=1, 2, ... , M, called clusters, that satisfy the following conditions.

Properties of a Clustering Partition K ∩ Clk k = 1 1. The members in each subset are in some sense similar and not similar to members in the other subsets. 2. Clk≠ Φ Not empty 3. Clk∩Clj≠ΦPairwise disjoint = S Exhaustive 4. Φ is the Null Set

Illustration of Clusters and Cluster centers

For quantitative data we use measures of similarity and dissimilarity between pattern samples and clusters Euclidean Distance between two pattern vectorsx and y The smaller the distance the larger the similarity

A few Methods for Clustering of Quantitative Data are * 1. K-Means Clustering Algorithm 2. Hierarchical Clustering Algorithm 3. ISODATA Clustering Algorithm 4. Fuzzy Clustering Algorithm

K-Means Clustering Algorithm: Basic Procedure Randomly Select K cluster centers from Pattern Space or data set Distribute set of patterns to the cluster center using minimum distance Compute new Cluster centers for each cluster Continue this process until the cluster centers do not change.

Three Main Types of Data 1. Quantitative 2. Qualitative or categorical 3. Mixed Quantitative and Qualitative The first two types can be further broken down into special categories as follows

Quantitative Data

Qualitative or Categorical Data

Nominal Non-Binary:Examples Color of eyes: { blue, brown, black, green, gray } Car Owned: { Ford, Toyota, Kia, Renault, Mercedes } Nominal Binary:Examples Answer to true false question: { True, False } Position of a switch: { on, off }

Linearly ordered Ordinal Qualitative Answer to a person’s health { excellent, very good, average, fair, poor } Hierarchical or Structurally Ordered Qualitative Answer to type of figure { rectangle, triangle, hexagon, circle, ellipse}

Hierarchical Structured Qualitative Data- Answer to type of figure

Lattice Structurally Ordered Qualitative Answer to type of education { Elementary School, High school, Apprenticeship, Undergraduate school, on the job training, Graduate school, post graduate school }

Measure of Performance for Clustering of Categorical Data Overall performance measure J for a given set of clusters Clkfor k =1, 2, ... , K Mk is the kth cluster Representative Vector where i k i k ||| ... ||| is the measure of distance for data vectors

We wish to minimize J by the selection of the representative elements of each cluster and the elements of each cluster. ( the partition of the data set) This overall performance measure J can be minimized by a two-stage iterative process similar to the steps given in the standard K-Means algorithm.

Proposed Modified K-Means Clustering for Qualitative Data - Basic Procedure Randomly Select K cluster centers from Pattern Space or data set Distribute set of patterns to the cluster center using minimum distance Compute new Cluster centers for each cluster Continue this process until the cluster centers do not change.

Proposed Modified K-Means Algorithm (Step 0) Selection of Initial Cluster Centers. There are many ways to select the initial cluster centers. Perhaps the simplest way is to select a set of sequences in the data set randomly.

(Step 1) Redistribution of Sequences to Cluster Centers We have chosen to redistribute each sequence to the cluster center that is its nearest neighbor. Thus, each vector is assigned to the closest cluster center where closest is with respect to some predefined distance measure.

(Step 2) Selection of Cluster Centers Choose the sequence in the cluster that has the smallest sum of distances from the sequence to all other sequences in the cluster. Resolve ties randomly The fact that the cluster center is selected in this way, always a member of the data set, contrasts with the standard K-means algorithm for numerical data, where the cluster center is not necessarily a member of the original data set because it represents the average of the points in each cluster. Steps (1) and (2) are repeated until the cluster centers do not change

Examples Using proposed clustering Algorithm Example 1: Structural data with missing components Example 2: Archaeological Sequential Data

Example 1: Missing data sequential Letting b equal an unknown symbol the above can be written as Use the modified K-Means clustering algorithm to obtain two clusters of the data set.

Solution: First define a measure of distancebetween members of the data sets as Subjective assignment Next randomly select cluster centers

Redistribute the samples to the cluster centers Using the defined distance measure assign to the nearest cluster center This yields the following new trial clusters

Determine new cluster Center for cluster 1using minimum row sum Therefore since row three has the smallest sum the new Cluster 1 Center becomes

Determine new cluster Center for Cluster 2 using minimum row sum Therefore since row two has the smallest sum the new Cluster 2 center becomes

Redistribute to obtain

Determine new cluster Center for Cluster 1 using minimum row sum Therefore since row three has the smallest sum the new Cluster 1 center becomes

Determine new cluster Center for Cluster 2 using minimum row sum These are same clustering centers as previous iteration thus the final clustering becomes

Tree of possible sequences for Example 1 R R

Example 2: Clustering of Archaeological data Sample #1:Sample #2: Sherds general fill general fill sherds whole pots roof fall roof fall wall fall wall fall whole pots floor artifacts floor artifacts Depositional Sequence

Archaeological Categorical Sequential Data Euclidean Distance is no longer a meaningful or for that matter a computable measure of distance. Thus, new intra-set distance measures must be defined as well as different methods for selecting representative elements or cluster centers. Techniques for clustering different size vectors or vectors containing sequential relationships have not received the attention of researchers, perhaps because software is limited to conveniently handle this type of problem.

The following code was set up for this example N = Floors with few or no artifacts (<10) M = Floors with many artifacts (>10) U = Layer of unburned roofing material B = Layer of burned roofing. T = Refuse S = Deposits of aeolian sand D = Detritus from the cave roof ? = Unknown deposits

Given the Broken Flute Cave Strata Data Set Find four clusters that characterize the data

Solution: Define a distance measure as the minimum weighted number of changes or steps required to transform one stratigraphic sequence into another. Transformation rules (1) addition or deletion of a stratum, (2) changes in kind of stratum, and (3) order of strata. Reversal of order, Use differentially weighted measures for transformations as follow

Weights for various transformations • The addition or deletion of a stratum, e.g., adding sand (S) or deleting trash (T) were weighted the least. Such transformations were assigned a distance of 1 unit. • 2) Changes in kinde.g., burned roofing (B) vs. unburned roofing (U) strata, were more heavily weighted. Each was assigned a distance of 1.5 units.

3) Had we encountered reversals of order, e.g., burned roofing over sand (B S) vs. sand deposited over burned roofing (S B), we would have weighted them heaviest assigning each a distance of 2 units. Reason for weighing reversals highly is because they represent a significantly different behavioral and depositional sequence.

Distances Between Stratigraphic Sequences at Broken Flute Cave

Selection of Initial Cluster Centers We chose the four initial cluster centers randomly as x11, x5 , x9, x8a .

Continuing with the iterations until convergence gives the final four clusters as Cl1 = { x11 } = { NSUT} Cl2 = { x5, x8a } = { NSBT, NSB} Cl3 = { x6, x7, x8, x9 } = { MBT, MBT, MBT, MBTD} Cl4 = { x1, x2, x3, x4, x12 } = {NB, NB, NBT, NBT, NBT} We will not take time to discuss the archaeological significance of this result.

Advantages of the proposed method 1. Can obtain clusters for non quantitative data typical of archaeological data obtained in field work 2. Since the new cluster center of each cluster is always a member of the data, distances between samples need only be computed once and recalled from memory when needed reducing computation. 3. Rerunning algorithm provides different interpretations of the data. 4. Some structural information is provided by the resulting cluster centers

Disadvantages of the proposed method 1. Can converge to a local minimum 2. Must be run several times with different random initial cluster centers 3. Results are dependent on the subjective distance measures used. 4. Will always find clusters whether there are any physical clusters existing or not. 5. With very large data set ( N data points ) requires storage of N(N-1)/2 distances or recalculation of distances.

Lecture Summary • Discussed the General Concept of Clustering • Presented definitions of Types of Data • Reviewed the K-Means Clustering Algorithm • Described Clustering Method for non- Quantitative data • Presented two examples illustrating the method. • Discussed advantages and disadvantages of the proposed clustering method

Thank you for your attention and I am happy to answer any questions you might have regarding this presentation.

Clustering of Non-Quantitative Data

Clustering of Non-Quantitative Data

Presentation Transcript

Visualization of Quantitative Data

Data Mining: Clustering

Data Mining--Clustering

Quantitative Methods of Data Analysis

Clustering Data Streams

Clustering Data Streams

Statistical Analysis of Quantitative Data

Clustering of location-based data

Data Stream Clustering

Analysis of Quantitative Data

Clustering Uncertain Data

Quantitative measurements of non contact interaction

Data Clustering

Clustering microarray data

Quantitative Data

Data Clustering

Clustering Biological Data

Clustering Categorical Data