1 / 14

DATA MINING

DATA MINING. CLUSTERING K-Means. Clustering Definition. Techniques that are used to divide data objects into groups A form of classification in that it creates a labelling object with class(cluster) labels. The labels are derived from the data

maris
Download Presentation

DATA MINING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA MINING CLUSTERING K-Means

  2. ClusteringDefinition • Techniques that are used to divide data objects into groups • A form of classification in that it creates a labelling object with class(cluster) labels. The labels are derived from the data • Cluster analysis is categorized as unsupervised classification • When you have no idea how to define groups, clustering method can be useful

  3. Types of Clustering • Hierarchical vs Partitional • Hierarchical  nested cluster, organized as tree • Partitional  fully non-overlapping • Exclusive vs Overlapping vs Fuzzy • Exclusive  each object is assigned to a single cluster • Overlapping  an object can simultaneously belong to more than one cluster • Fuzzy  every object belongs to every cluster with a membership weigth that is between 0 and 1 • Complete vs Partial • Complete  assigns every object to cluster • Partial  not all objects are assigned

  4. Types of Clusters • Well-separated • Prototype-based • Graph-based • Density-based • Shared-property(Conceptual Cluster)

  5. K-Means • Partitional clustering • Prototype-based • One level

  6. Basic K-Means • k, the number of clusters that are to be formed, must be decided before beginning • Step 1 • Select k data points to act as the seeds (or initial cluster centroids) • Step 2 • Each record is assigned to the centroid which is nearest, thus forming a cluster • Step 3 • The centroids of the new clusters are then calculated. Go back to Step 2

  7. Basic K-means -2- Determine cluster boundaries Assign each record to the nearest centroid Calculate new centroid

  8. Choosing Initial Centroids • Random initial centroids • Poor • Can have empty cluster • Limits of random initialization • Multiple runs with different set of randomly choosen centroids then select the set of cluster with the minimum SSE

  9. Similarity, Association, and Distance • The method just described assumes that each record can be described as a point in a metric space • This is not easily done for many data sets (e.g., categorical and some numeric variables) • Pre-processing is often necessary • Records in a cluster should have a natural association. A measure of similarity is required. • Euclidean distance is often used, but it is not always suitable • Euclidean distance treats changes in each dimension equally, but changes in one field may be more important than changes in another • and changes of the same “size” in different fields can have very different significances • e.g. 1 metre difference in height vs. $1 difference in annual income

  10. Measures of Similarity • Euclidean distance between vectors X and Y • Weighting

  11. Redefine Cluster Centroids • Sum of the Squared Error for data in euclidean space. The centroid(mean) of the ith cluster is defined: • Other case:

  12. Bisecting K-means • Basic idea: • Split the set of all points into two cluster • Select one of these clusters to split • so on, until K cluster have been produced • Choose the cluster to split: • Cluster with largest SSE • Cluster with largest size • Both, or other criterion • Bisecting is less susceptible to initialization problems

  13. Strengths and Weaknesses • Strengths • Simple and can be used for wide variety data types • Efficient in computation • Weaknesses • Not suitable for all types of data • Cannot contains outliers, should be remove • Restricted to data for which there is a notion of a center(centroids)

More Related