clustering techniques n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering Techniques PowerPoint Presentation
Download Presentation
Clustering Techniques

play fullscreen
1 / 57

Clustering Techniques

496 Views Download Presentation
Download Presentation

Clustering Techniques

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Clustering Techniques

  2. Clustering Outline Goal:Provide an overview of the clustering problem and introduce some of the basic algorithms • Clustering Problem Overview • Clustering Techniques • Hierarchical Algorithms • Partitional Algorithms • Genetic Algorithm • Clustering Large Databases

  3. Clustering Examples • Segment customer database based on similar buying patterns. • Group houses in a town into neighborhoods based on similar features. • Identify new plant species • Identify similar Web usage patterns

  4. Clustering Example

  5. Size Based Geographic Distance Based Clustering Houses

  6. Clustering vs. Classification • No prior knowledge • Number of clusters • Meaning of clusters • Unsupervised learning

  7. Clustering Issues • Outlier handling • Dynamic data • Interpreting results • Evaluating results • Number of clusters • Data to be used • Scalability

  8. Impact of Outliers on Clustering

  9. Clustering Problem • Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. • A Cluster, Kj, contains precisely those tuples mapped to it. • Unlike classification problem, clusters are not known a priori.

  10. Types of Clustering • Hierarchical – Nested set of clusters created. • Partitional – One set of clusters created. • Incremental – Each element handled one at a time. • Simultaneous– All elements handled together. • Overlapping/Non-overlapping

  11. Hierarchical Partitional Categorical Large DB Agglomerative Divisive Clustering Approaches Clustering Sampling Compression

  12. Cluster Parameters

  13. Distance Between Clusters • Single Link: smallest distance between points • Complete Link: largest distance between points • Average Link:average distance between points • Centroid:distance between centroids

  14. Hierarchical Clustering • Clusters are created in levels actually creating sets of clusters at each level. • Agglomerative • Initially each item in its own cluster • Iteratively clusters are merged together • Bottom Up • Divisive • Initially all items in one cluster • Large clusters are successively divided • Top Down

  15. Hierarchical Algorithms • Single Link • MST Single Link • Complete Link • Average Link

  16. Dendrogram • Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. • Each level shows clusters for that level. • Leaf – individual clusters • Root – one cluster • A cluster at level i is the union of its children clusters at level i+1.

  17. Levels of Clustering

  18. Agglomerative Example A B E C D Threshold of 1 2 3 4 5 A B C D E

  19. MST Example A B E C D

  20. Agglomerative Algorithm

  21. Single Link • View all items with links (distances) between them. • Finds maximal connected components in this graph. • Two clusters are merged if there is at least one edge which connects them. • Uses threshold distances at each level. • Could be agglomerative or divisive.

  22. MST Single Link Algorithm

  23. Single Link Clustering

  24. Partitional Clustering • Nonhierarchical • Creates clusters in one step as opposed to several steps. • Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. • Usually deals with static sets.

  25. Partitional Algorithms • MST • Squared Error • K-Means • Nearest Neighbor • PAM • BEA • GA

  26. MST Algorithm

  27. Squared Error • Minimized squared error

  28. Squared Error Algorithm

  29. K-Means • Initial set of clusters randomly chosen. • Iteratively, items are moved among sets of clusters until the desired set is reached. • High degree of similarity among elements in a cluster is obtained. • Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim)

  30. K-Means Example • Given: {2,4,10,12,3,20,30,11,25}, k=2 • Randomly assign means: m1=3,m2=4 • K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 • K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 • K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 • K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 • Stop as the clusters with these means are the same.

  31. K-Means Algorithm

  32. Nearest Neighbor • Items are iteratively merged into the existing clusters that are closest. • Incremental • Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

  33. Nearest Neighbor Algorithm

  34. PAM • Partitioning Around Medoids (PAM) (K-Medoids) • Handles outliers well. • Ordering of input does not impact results. • Does not scale well. • Each cluster represented by one item, called the medoid. • Initial set of k medoids randomly chosen.

  35. PAM

  36. PAM Cost Calculation • At each step in algorithm, medoids are changed if the overall cost is improved. • Cjih – cost change for an item tj associated with swapping medoid ti with non-medoid th.

  37. PAM Algorithm

  38. BEA • Bond Energy Algorithm • Database design (physical and logical) • Vertical fragmentation • Determine affinity (bond) between attributes based on common usage. • Algorithm outline: • Create affinity matrix • Convert to BOND matrix • Create regions of close bonding

  39. BEA Modified from [OV99]

  40. Genetic Algorithm Example • {A,B,C,D,E,F,G,H} • Randomly choose initial solution: {A,C,E} {B,F} {D,G,H} or 10101000, 01000100, 00010011 • Suppose crossover at point four and choose 1st and 3rd individuals: 10100011, 01000100, 00011000 • What should termination criteria be?

  41. GA Algorithm

  42. Clustering Large Databases • Most clustering algorithms assume a large data structure which is memory resident. • Clustering may be performed first on a sample of the database then applied to the entire database. • Algorithms • BIRCH • DBSCAN • CURE

  43. Desired Features for Large Databases • One scan (or less) of DB • Online • Suspendable, stoppable, resumable • Incremental • Work with limited main memory • Different techniques to scan (e.g. sampling) • Process each tuple once

  44. BIRCH • Balanced Iterative Reducing and Clustering using Hierarchies • Incremental, hierarchical, one scan • Save clustering information in a tree • Each entry in the tree contains information about one cluster • New nodes inserted in closest entry in tree

  45. Clustering Feature • CT Triple: (N,LS,SS) • N: Number of points in cluster • LS: Sum of points in the cluster • SS: Sum of squares of points in the cluster • CF Tree • Balanced search tree • Node has CF triple for each child • Leaf node represents cluster and has CF value for each subcluster in it. • Subcluster has maximum diameter

  46. BIRCH Algorithm

  47. Improve Clusters

  48. DBSCAN • Density Based Spatial Clustering of Applications with Noise • Outliers will not effect creation of cluster. • Input • MinPts – minimum number of points in cluster • Eps – for each point in cluster there must be another point in it less than this distance away.

  49. DBSCAN Density Concepts • Eps-neighborhood: Points within Eps distance of a point. • Core point: Eps-neighborhood dense enough (MinPts) • Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. • Density-reachable: A point si density-reachable form another point if there is a path from one to the other consisting of only core points.

  50. Density Concepts