DATA MINING

DATA MINING CLUSTERING K-Means

ClusteringDefinition • Techniques that are used to divide data objects into groups • A form of classification in that it creates a labelling object with class(cluster) labels. The labels are derived from the data • Cluster analysis is categorized as unsupervised classification • When you have no idea how to define groups, clustering method can be useful

Types of Clustering • Hierarchical vs Partitional • Hierarchical  nested cluster, organized as tree • Partitional  fully non-overlapping • Exclusive vs Overlapping vs Fuzzy • Exclusive  each object is assigned to a single cluster • Overlapping  an object can simultaneously belong to more than one cluster • Fuzzy  every object belongs to every cluster with a membership weigth that is between 0 and 1 • Complete vs Partial • Complete  assigns every object to cluster • Partial  not all objects are assigned

Types of Clusters • Well-separated • Prototype-based • Graph-based • Density-based • Shared-property(Conceptual Cluster)

K-Means • Partitional clustering • Prototype-based • One level

Basic K-Means • k, the number of clusters that are to be formed, must be decided before beginning • Step 1 • Select k data points to act as the seeds (or initial cluster centroids) • Step 2 • Each record is assigned to the centroid which is nearest, thus forming a cluster • Step 3 • The centroids of the new clusters are then calculated. Go back to Step 2

Basic K-means -2- Determine cluster boundaries Assign each record to the nearest centroid Calculate new centroid

Choosing Initial Centroids • Random initial centroids • Poor • Can have empty cluster • Limits of random initialization • Multiple runs with different set of randomly choosen centroids then select the set of cluster with the minimum SSE

Similarity, Association, and Distance • The method just described assumes that each record can be described as a point in a metric space • This is not easily done for many data sets (e.g., categorical and some numeric variables) • Pre-processing is often necessary • Records in a cluster should have a natural association. A measure of similarity is required. • Euclidean distance is often used, but it is not always suitable • Euclidean distance treats changes in each dimension equally, but changes in one field may be more important than changes in another • and changes of the same “size” in different fields can have very different significances • e.g. 1 metre difference in height vs. $1 difference in annual income

Measures of Similarity • Euclidean distance between vectors X and Y • Weighting

Redefine Cluster Centroids • Sum of the Squared Error for data in euclidean space. The centroid(mean) of the ith cluster is defined: • Other case:

Bisecting K-means • Basic idea: • Split the set of all points into two cluster • Select one of these clusters to split • so on, until K cluster have been produced • Choose the cluster to split: • Cluster with largest SSE • Cluster with largest size • Both, or other criterion • Bisecting is less susceptible to initialization problems

Strengths and Weaknesses • Strengths • Simple and can be used for wide variety data types • Efficient in computation • Weaknesses • Not suitable for all types of data • Cannot contains outliers, should be remove • Restricted to data for which there is a notion of a center(centroids)

DATA MINING

DATA MINING

Presentation Transcript

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining: Data

Data Mining

DATA MINING

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data