1 / 25

A Genetic Algorithm Approach to K -Means Clustering

A Genetic Algorithm Approach to K -Means Clustering. Craig Stanek CS401 November 17, 2004. What Is Clustering?. “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that:

Download Presentation

A Genetic Algorithm Approach to K -Means Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Genetic Algorithm Approach to K-Means Clustering Craig Stanek CS401 November 17, 2004

  2. What Is Clustering? • “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: • Each cluster has instances that are very similar (or “near”) to each other, and • The instances in each cluster are very different (or “far away”) from the instances in the other clusters” • --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”

  3. Why Cluster? Segmentation and Differentiation

  4. Why Cluster? Outlier Detection

  5. Why Cluster? Classification

  6. K-Means Clustering • Specify K clusters • Randomly initialize K “centroids” • Classify each data instance to closest cluster according to distance from centroid • Recalculate cluster centroids • Repeat steps (3) and (4) until no data instances move to a different cluster

  7. Drawbacks of K-Means Algorithm • Local rather than global optimum • Sensitive to initial choice of centroids • K must be chosen apriori • Minimizes intra-cluster distance but does not consider inter-cluster distance

  8. Problem Statement • Can a Genetic Algorithm approach do better than standard K-means Algorithm? • Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? • Can a GA be used to find the optimum number of clusters for a given data set?

  9. 58 244 23 162 113 Representation of Individuals • Randomly generated number of clusters • Medoid-based integer string (each gene is a distinct data instance) • Example:

  10. Genetic Algorithm Approach Why Medoids?

  11. Genetic Algorithm Approach Why Medoids?

  12. Genetic Algorithm Approach Why Medoids?

  13. 5 36 80 108 82 147 82 6 Recombination Parent #1: 36 108 82 Parent #2: 6 5 80 147 82 108 Child #1: Child #2:

  14. Fitness Function Let rijrepresent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X

  15. Experimental Setup Iris Plant Data (UCI Repository) • 150 data instances • 4 dimensions • Known classifications • 3 classes • 50 instances of each

  16. Experimental Setup Iris Data Set

  17. Experimental Setup Iris Data Set

  18. Standard K-Means vs. Medoid-Based EA

  19. Standard K-Means Clustering Iris Data Set

  20. Medoid-Based EA Iris Data Set

  21. Standard Fitness EA vs. Proposed Fitness EA

  22. Fixed vs. Variable Number of Clusters EA

  23. Variable Number of Clusters EA Iris Data Set

  24. Conclusions • GA better at obtaining globally optimal solution • Proposed fitness function shows promise • Difficulty letting GA determine “correct” number of clusters on its own

  25. Future Work • Other data sets • Alternative fitness function • Scalability • GA comparison to simulated annealing

More Related