ROCK: A ROBUST CLUSTERING ALGORITHM FOR CATEGORICAL ATTRIBUTES

1. ROCK: A ROBUST CLUSTERING ALGORITHM FOR CATEGORICALATTRIBUTES Presentation by: Toriola Olusegun Panagiotis Tsiatsis George Veioglanis Richard Tomsett

2. Outline Introduction Related Work Clustering Paradigm The ROCK Clustering Algorithm Experimental Result Conclusion & Remarks

3. Introduction Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Types(Traditional Clustering Algorithms): Hierarchical :Find successive clusters using previously established clusters - Agglomerative ("bottom-up")? - Divisive ("top-down")? Partitional : determine all clusters at once Common distance functions: Euclidean distance, Manhattan distance, Mahalanobis distance

4. Market Basket Data A transaction represents one customer, and each transactions contains set of items purchased by the customer Use to cluster the customers so that customers with similar buying pattern are in a cluster. Use for Characterizing different customer groups Targeted Marketing Predict buying patterns of new customers based on profile A market basket database -A scenario where attributes of data points are non-numeric, transaction viewed as records with boolean attributes corresponding to a single item (TRUE if transaction contain item, FALSE otherwise: Boolean attributes are special case of categorical Attributes

5. Criterion Function Given n data points in a d-dimensional space, a clustering algorithm partitions the data points into k clusters Partitional divide the point space into k clusters that optimize a certain criterion function Criterion function for metric spaces commonly used is Euclidean Distance defined as: E = Here is the centroid of cluster while is the euclidean distance between and Criterion function E attempts to minimize distance of every point from the mean of the cluster to which the point belongs Another approach is iterative hill climbing technique

6. Shortcomings of Traditional Clustering Algorithms Consider 4 transactions over items 1, 2, 3, 4, 5 and 6 (a) - {1,2,3,5} (b) - {2,3,4,5} (c) - {1,4} (d) � {6} View transactions as points with Boolean (0/1) corresponding to false/true Representation:(1,1,1,0,1,0), (0,1,1,1,1,0), (1,0,0,1,0,0) and (0,0,0,0,0,1)? Distance between first two points is Smallest distance between pairs of points so centroid-based hierarchical algorithm merge the points Centroid of the new merged cluster is (0.5,1,1,0.5,1,0)? Next step, the third and fourth points are merged since the distance between them is which is less than the distance between the centroid of the merged cluster from each of the other points.

7. Market Basket Analysis (1)? Number of attributes appearing in the mean go up, and value in the mean decreases. Become difficult to distinguish the difference between two points that differ on few attributes, or two points that differ on every attribute by small amounts. Example Consider the means of two clusters ( 1/3,1/3,1/3,0,0,0) and (0,0,0,1/3,1/3,1/3) with roughly the same number of points Though, have no attributes in common, the distance between the two means is less than the distance of the point (1, 1, 1, 0, 0, 0) to the mean of the first cluster

8. Market Basket Analysis (2)? Undesirable since the point shares common attributes with the first cluster An oblivious method based on distance will merge the two clusters and will generate a new cluster with mean (1/6,1/6,1/6,1/6,1/6,1/6)? Interestingly, the distance of the point (1, 1, 1, 0, 0, 0) to the new cluster is even larger ! These centres tend to spread out in all the attribute values and lose the information about the points in the cluster that they represent

9. Market Basket Analysis (3)? For Document Clustering, Jaccard Coefficient (JC) has been used for similarity measure instead of euclidean distance Consider a market basket database over items 1,2,3,4,5,6,7,9. Consider the 2 transaction clusters: First cluster is defined by 5 items while the second is defined by 4 <1,2,3,4,5> <1,2,6,7>

10. Market Basket Analysis (4)? JC between an arbitrary pair of transactions belonging to the first cluster ranges from 0.2 (e.g.{1,2,3} & {3,4,5}) to 0.5 (e.g. {1,2,3} & {1,2,4})? Note though {1,2,3} & {1,2,7} share common items and have a high JC of 0.5, they belong to different clusters Contrast {1,2,3} & {3,4,5) with lower JC of 0.2, but belong to the same cluster.

11. Related Work for Arithmetic Data Traditional clustering algorithms suited for numeric data attributes rather than categorical attributes. CLARANS employs a randomized search to find the k best cluster medoids. BIRCH, first pre clusters data and then uses a centroid-based hierarchical algorithm to cluster the partial clusters CURE algorithm uses a combination of random sampling and partition clustering to handle large databases. DBSCAN, a density-based algorithm grows clusters by including the dense neighbourhoods of points already in the cluster This approach, however,may be prone to errors if clusters are not well-separated.

17. Criterion Function Each cluster has to have a high degree of connectivity Maximize the sum of link(p(q), p(r)) for data point pairs p(q), p(r) belonging to a single cluster and at the same time, minimize the sum of link(p(q); p(s)) for p(q), p(s) in different clusters Heuristic: Total Number of Links in Cluster: n exp (1+2*f(?))? Errors in the estimation of f(?) affects all the clusters similarly due to this normalization Assumption: Each point in cluster has n exp( f(?)) neighbours, so it contributes n exp( 2*f(?) ) links Usually f(?) is set to (1-?) / (1+?)?: it depends on the dataset

18. ROCK algorithm (1)? Random sampling Clustering with links Labelling data on database first, draw random sample from the database then cluster the data with the link technique after clustering sample, label the remaining data on the diskfirst, draw random sample from the database then cluster the data with the link technique after clustering sample, label the remaining data on the disk

19. Rock (2) Random Sampling Usually large number of data Enables ROCK to reduce the number of points considered � reduce complexity Clusters generated by the sample points With appropriate sample size, the quality of clustering is not affected

20. ROCK (3)? Goodness measure determines the best pairs of clusters to merge at each step g(Ck,Cj) = link[Ck,Cj]=cross links between clusters denominator = expected number of cross links between pairs of points in k, j (normalization factor and penalty)? pairs with maximum g are the best pairs to merge

21. ROCK (4)? procedure cluster (S, k)? begin 1. link :=compute_links(S)? 2. for each s E S do 3. q[s] := build_local_heap(link, s)? 4. Q:=build_global_heap(S,q)? 5. while size(Q) > k do { 6. u:= extract_max(Q)? 7. v:=max(q[u])? 8. delete (Q,v)? 9. w:=merge(u,v)? 10. for each x E q(u) U q(v) do { 11. link [x,w] := link[x,u] + link[x,v] 12. delete (q[x],u); delete (q[x],v)? 13. insert (q[x], w, g(x,w)); insert (q[w], x, g(x,w))? 14. update (Q, x, q[x])? 15. } 16. insert (Q, w, q[w])? 17. deallocate(q[u]); deallocate(q[v]) 18. } end

22. Rock (5)- Complexity Rock�s complexity is O(max(n2ma, n2logn))? Dominated by Computation of links, O(n2ma) ma is the average number Update of the local heaps with the new cluster w, O(n2logn)?

23. ROCK(6) Labeling data on database assigns the remaining data points in database to the clusters generated a set of points Li from each cluster i for each remaining point p compute Ni , the neighbours in Li p is assigned to cluster i that Ni is maximum

39. Other Things to consider Online Categorical Clustering instead of Sampling (at least until a reasonable number of clusters is formed)? Sampling can be guaranteed to give good representatives if certain conditions hold. Complexity: O(n) Online one-pass clustering is a bit slower, but more reliable and contributes to retaining the structure of the data We can set it to find a large enough number of clusters and thereafter an offline algorithm (ie ROCK) can continue Weighted Links for repetitive transactions Adaptive Theta as the clustering procedure evolves Other Algorithms for Clustering Categorical Data CACTUS, STIRR, SCLOPE (online)?, QROCK

41. Any questions?

ROCK: A ROBUST CLUSTERING ALGORITHM FOR CATEGORICAL ATTRIBUTES

ROCK: A ROBUST CLUSTERING ALGORITHM FOR CATEGORICAL ATTRIBUTES

Presentation Transcript

EM Algorithm: Expectation Maximazation Clustering Algorithm book: “ DataMining, Morgan Kaufmann, Frank ”

Chapter 8

Birch: An efficient data clustering method for very large databases

Combinatorial Pattern Matching

HCS Clustering Algorithm

A Robust Algorithm for Pitch Tracking

A Distributed Clustering Algorithm for Target Tracking in Vehicular Ad-Hoc Networks

An Ant Colony Optimization Algorithm for Multi-objective Clustering in Mobile Ad Hoc Networks

The K-Means Clustering Method : for numerical attributes

Presenter : Keng -Yu Lin Author : Amir Ahmad , Lipika Dey PRL . 2011

Bayesian Hierarchical Clustering

An Algorithm for Testing Unidimensionality and Clustering Items in Rasch Measurement

CLUSTERING SCHEMES FOR MOBILE AD HOC NETWORK

A novel genetic algorithm for automatic clustering

Bi-correlation clustering algorithm for determining a set of co-regulated genes

CMune : A CLUSTERING USING MUTUAL NEAREST NEIGHBORS ALGORITHM

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis

A Self-Stabilizing O(n) -Round k -Clustering Algorithm

Towards a clustering algorithm for CALICE

SCAN: A Structural Clustering Algorithm for Networks

Comparison to simulated annealing clustering of leach-c algorithm *

Robust k-Coverage Algorithms for Sensor Networks