1 / 21

Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data. Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011. Outlines. Motivation Objectives Methodology Experiments

Download Presentation

Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data Presenter : Cheng-Han Tsai Authors : Liang Bai, Jiye Liang, Chuangyin Dang KBS, 2011

  2. Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

  3. Motivation The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance. We can’t guarantee the number of clusters we select are the best.

  4. Objectives • To propose an initialization method to find initial cluster centers and the number of clusters. • The method can efficiently deal with large categorical data in linear time.

  5. Methodology Data Set Construct a potential exemplars set S 1 2 4 Set the estimated number of clusters 3 5 The clustering result K-modes-type algorithm 7 6

  6. Hamming distance:Differences between two codes(using XOR)ex: 10001001XOR 10110001------------------------ 00111000 → Hamming distance = 3 Methodology The k-modes algorithm

  7. Methodology New cluster centers initialization method Finding the number of clusters

  8. Methodology New cluster centers initialization method.

  9. Methodology

  10. Methodology

  11. Methodology

  12. Methodology • Finding the number of clusters • We need to input a value k’which is a estimated number of clusters • If k’ can’t be determined, we set k’ = |S|

  13. Methodology

  14. Methodology

  15. Methodology More than 1 knee point of the function P(k) More than 1 peak of the function C(k)

  16. Experiments • Performance analysis • Soybean dada (4 diseases) • Lung cancer data (3 classes) • Zoo data (7 classes which has 3 big clusters and 4 small clusters) • Mushroom data (2 classes) • Scalability analysis

  17. Experiments Performance analysis

  18. Experiments

  19. Experiments • Scalability analysis • 67557 data points and 42 categorical attribute

  20. Conclusions The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters The time complexity has been analyzed in linear time

  21. Comments • Advantages • Improve the old method about setting the two parameters • Applications • Data clustering

More Related