1 / 39

k - medoid clustering with genetic algorithm

k - medoid clustering with genetic algorithm. Wei-Ming Chen 2012.12.06. Outline. k- medoids clustering famous works GCA : clustering with the add of a genetic algorithm Clustering genetic algorithm : also judge the number of clusters Conclusion. k- medoids clustering famous works

yoshi
Download Presentation

k - medoid clustering with genetic algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. k-medoid clustering with genetic algorithm Wei-Ming Chen 2012.12.06

  2. Outline k-medoids clustering famous works GCA : clustering with the add of a genetic algorithm Clustering genetic algorithm : also judge the number of clusters Conclusion

  3. k-medoidsclustering famous works GCA : clustering with the add of a genetic algorithm Clustering genetic algorithm : also judge the number of clusters Conclusion

  4. What is k-medoid clustering? Proposed in 1987 (L. Kaufman and P.J. Rousseeuw) There are N points in the space k points are chosen as centers (medoids) Classify other points into k groups Which k points should be chosen to minimize the summation of the points to its medoid

  5. Difficulty NP-hard Genetic algorithms can be applied

  6. k-medoid clustering famous works GCA : clustering with the add of a genetic algorithm Clustering genetic algorithm : also judge the number of clusters Conclusion

  7. Partitioning Around Medoids (PAM) Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley Group N data into k sets In every generation, select every pair of (Oi, Oj), where Oiis a medoid and Oj is not, if replace Oi by Ojwould reduce the distance, replace Oi by Oj Computation time : O(k(N-k)2) [one generation]

  8. Clustering LARge Applications (CLARA) Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley Reduce the calculation time Only select s data in original N data s = 40+2k seems a good choice Computation time : O(ks2+k(n-k)) [one generation]

  9. Clustering Large Applications based upon RANdomized Search (CLARANS) Ng, R., & Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th international conference on very large databases, Santiago, Chile (pp. 144–155) Do not try all pairs of (Oi, Oj) Try max(0.0125(k(N-k)), 250) different Ojto each Oi Computation time : O(N2) [one generation]

  10. k-medoids clustering famous works GCA : clustering with the add of a genetic algorithm Clustering genetic algorithm : also judge the number of clusters Conclusion

  11. GCA Lucasius, C. B., Dane, A. D., & Kateman, G. (1993). On k-medoid clustering of large data sets with the aid of a genetic algorithm: Background, feasibility and comparison. AnalyticaChimicaActa, 282, 647–669.

  12. Chromosome encoding N data, clustering to k groups Problem size = k (the number of groups) each location of the string is an integer (1~N) (a medoid)

  13. Initialization Each string in the population uniquely encodes a candidate solution of the target problem Random choose the candidates

  14. Selection Select M worst individuals in population and throw them out

  15. Crossover Select some individuals for reproducing M new population Building-block like crossover Mutation

  16. Crossover • For example, k =3, p1 = 2 3 7, p2 = 4 8 2 • 1. Mix p1 and p2 • Q = 21 31 71 42 82 22 • randomly scramble : Q = 4222 21 82 71 31 • 2. Add new material : first k elements may be changed • Q = 5 227 82 71 31 • 3. randomly scramble again • Q = 22717 315 82 • 4. The offspring are selected from left or from right • C1 = 2 7 3 , C2 = 8 5 3

  17. Experiment Under the limit of NFE < 100000 N = 1000, k = 15

  18. Experiment GCA versus Random search

  19. Experiment GCA versus CLARA (k = 15)

  20. Experiment GCA versus CLARA(k = 50)

  21. Experiment

  22. Paper’s conclusion GCA can handle both large values of k and small values of k GCA outperforms CLARA, especially when k is a large value GCA lends itself excellently for parallelization GCA can be combined with CLARA to obtain a hybrid searching system with better performance.

  23. k-medoids clustering famous works GCA : clustering with the add of a genetic algorithm Clustering genetic algorithm : also judge the number of clusters Conclusion

  24. Motivation In some cases, we do not actually know the number of clusters If we only know the upper limit?

  25. Hruschka, E.R. and F.F.E. Nelson. (2003). “A Genetic Algorithm for Cluster Analysis.” Intelligent Data Analysis 7, 15–25.

  26. Fitness function a(i) : the average distance of a individual to the individual in the same cluster d(i) : the average distance of a individual to the individual in a different cluster b(i) : the smallest of d(i, C)

  27. Fitness function • Silhouette • fitness = • This value will be high when… • small a(i) values • high b(i) values

  28. Chromosome encoding • N data, clustering to at most k groups • Problem size = N+1 • each location of the string is an integer (1~k) (belongs to which cluster ) • Genotype1: 22345123453321454552 5 • To avoid following problems: • Genotype2: 2|2222|111113333344444 4 • Genotype3: 4|4444|333335555511111 4 • Child2: 2 4444 111113333344444 4 • Child3: 4 2222 333335555511111 5 • Consistent Algorithm : 11234512342215343441 5

  29. Initialization Population size = 20 The first genotype represents two clusters, the second genotype represents three clusters, the third genotype represents four clusters, . . . , and the last one represents 21 clusters

  30. Selection roulette wheel selection normalize to

  31. Crossover Uniform crossover do not work Use Grouping Genetic Algorithm (GGA), proposed by Falkenauer (1998) First, two strings are selected A − 1123245125432533424 B − 1212332124423221321 Randomly select groups to preserve in A (For example, group 2 and 3)

  32. Crossover A − 1123245125432533424 B − 1212332124423221321 C − 0023200020032033020 Check the unchanged group in B and place in C C − 0023200024432033020 Another child : form by the groups in B (without which is actually placed in C) D − 1212332120023221321

  33. Crossover A − 1123245125432533424 B − 1212332124423221321 C − 0023200024432033020 Another child : form by the groups in B (without which is actually placed in C) D − 1212332120023221321 Check the unchanged group in A and place in D The other objects (whose alleles are zeros) are placed to the nearest cluster

  34. Mutation Two ways for mutation 1. randomly chosen a group, places all the objects to the remaining cluster that has the nearest centroid 2. divides a randomly selected group into two new ones Just change the genotypes in the smallest possible way

  35. Experiment 4 test problems (N = 75, 200, 699, 150)

  36. Experiment Ruspini data (N = 75)

  37. Paper’s conclusion Do not need to know the number of groups Find out the answer of four different test problems successfully Only on small population size

  38. k-medoids clustering famous works GCA : clustering with the add of a genetic algorithm Clustering genetic algorithm : also judge the number of clusters Conclusion

  39. Conclusion Genetic algorithms is an acceptable method for clustering problems Need to design crossover carefully Maybe EDAs can be applied Some theses? Or final projects!

More Related