1 / 36

Evaluation of Clustering Techniques on DMOZ Data

Evaluation of Clustering Techniques on DMOZ Data. Alper R i fat Uluçınar Rıfat Özcan Mustafa Canım. Outline. What is DMOZ and why do we use it? What is our aim? Evaluation of partitioning clustering algorithms Evaluation of hierarchical clustering algorithms Conclusion.

ckurt
Download Presentation

Evaluation of Clustering Techniques on DMOZ Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Clustering Techniques on DMOZ Data • Alper Rifat Uluçınar • Rıfat Özcan • Mustafa Canım

  2. Outline • What is DMOZ and why do we use it? • What is our aim? • Evaluation of partitioning clustering algorithms • Evaluation of hierarchical clustering algorithms • Conclusion

  3. What is DMOZ and why do we use it? • www.dmoz.org • Another name for ODP, Open Directory Project • The largest human edited directory on the Internet • 5,300,000 sites • 72,000 editors • 590,000 categories

  4. What is our aim? • Evaluating cluster algorithms is not easy • We will use DMOZ as reference point (ideal cluster structure) • Run our own cluster algorithms on same data • Finally compare results.

  5. All DMOZ documents (websites) Applying Clustering Algorithms such as C3M, K Means etc. Human Evaluation ? DMOZ Clusters ??

  6. A) Evaluation of Partitioning Clustering Algorithms • 20,000 documents from DMOZ • flat partitioned data (214 folders) • We applied html parsing, stemming, stop word list elimination • We will apply two clustering algorithms : • C3M • K-Means

  7. Before applying html parsing, stemming, stop word list elimination

  8. After applying html parsing, stemming, stop word list elimination

  9. 20,000 DMOZ documents ApplyingC3M HumanEvaluation 214 Clusters 642 Clusters

  10. DMOZ Clusters C3M Clusters 214 Clusters 642 Clusters How to compare DMOZ Clusters and C3M clusters ? Answer: Corrected Rand

  11. Validation of Partitioning Clustering • Comparison of two clustering structures • N documents • Clustering structure 1: • R clusters • Clustering structure 2: • C clusters • Metrics [1]: • Rand Index • Jaccard Coefficient • Corrected Rand Coefficient

  12. ….. ….. d1,d2 d1 d2 Type II, Frequency: b ….. ….. ….. ….. d1 d1,d2 d1 d1 d2 d2 d2 Type IV, Frequency: d Type III, Frequency: c Validation of Partitioning Clustering ….. ….. d1,d2 d1,d2 Type I, Frequency: a

  13. Validation of Partitioning Clustering • Rand Index = (a+d) / (a+b+c+d) • Jaccard Coefficient = a / (a+b+c) • Corrected Rand Coefficient • Accounts for randomness • Normalize rand index so that 0 when the partitions are selected by chance and 1 when a perfect match achieved. • CR = (R – E(R)) / (1 – E(R))

  14. Validation of Partitioning Clustering • Example: • Docs: d1 ,d2 ,d3 ,d4 ,d5 ,d6 • Clustering Structure 1: • C1: d1 ,d2 ,d3 • C2: d4 ,d5 ,d6 • Clustering Structure 2: • D1: d1 ,d2 • D2: d3 ,d4 • D3: d5 ,d6

  15. Validation of Partitioning Clustering • Contingency Table: a : (d1, d2), (d5, d6) b : (d1, d3), (d2, d3), (d4, d5), (d4, d6) c : (d3, d4) d : remaining 8 pairs (15-7) Rand Index = (2+8)/15 = 0.66 Jaccard Coeff. = 2/(2+4+1) = 0.29 Corrected Rand = 0.24

  16. Results • Results: • Low corrected rand and jaccard values • ~=0.01 • Rand index ~= 0.77 • Possible Reasons: • Noise in the data • Ex: 300 Document Not Found pages. • Problem is difficult: Ex: Homepages category.

  17. B) Evaluation of Hierarchical Clustering Algorithms • Obtain a partitioning of DMOZ • Determine a depth (experiment?) • Collect documents of higher (or equal) depth at that level • Documents of lower depths? Ignore them…

  18. Hierarchical Clustering: Steps • Obtain the hierarchical clusters using: • Single Linkage • Average Linkage • Complete Linkage • Obtain a partitioning on the hierarchical cluster…

  19. Hierarchical Clustering: Steps • One way, treat DMOZ clusters as “queries”: • For each selected cluster of DMOZ • Find the number of “target clusters” on computerized partitioning • Take the average • See if Nt < Ntr • If not, either choice of partitioning or hierarchical clustering did not perform well…

  20. Hierarchical Clustering: Steps • Another way: • Compare the two partitions using an index, i.e. C-RAND…

  21. Choice of Partition: Outline • Obtain the dendrogram • Single linkage • Complete linkage • Group average linkage • Ward’s methods

  22. Choice of Partition: Outline • How to convert a hierarchical cluster structure into a partition? • Visually inspect the dendrogram? • Use tools from statistics?

  23. Choice of Partition: Inconsistency Coefficient • At each fusion level: • Calculate the “inconsistency coefficient” • Utilize statistics from the previous fusion levels • Choose the fusion level for which inconsistency coefficient is at maximum.

  24. Choice of Partition: Inconsistency Coefficient • Inconsistency coefficient (I.C.) at fusion level i:

  25. Choice of Partition:I.C. Hands on, Objects • Plot of the objects • Distance measure: Euclidean Distance

  26. Choice of Partition:I.C. Hands on, Single Linkage

  27. Choice of Partition:I.C. Single Linkage Results Level 1  0 Level 2  0 Level 3  0 Level 4  0 Level 5  0 Level 6  1.1323 Level 7  0.6434 => Cut the dendrogram at a height between level 5 & 6

  28. Choice of Partition:I.C. Single Linkage Results

  29. Choice of Partition:I.C. Hands on, Average Linkage

  30. Choice of Partition:I.C. Average Linkage Results Level 1  0 Level 2  0 Level 3  0.7071 Level 4  0 Level 5  0.7071 Level 6  1.0819 Level 7  0.9467 => Cut the dendrogram at a height between level 5 & 6

  31. Choice of Partition:I.C. Hands on, Complete Linkage

  32. Choice of Partition:I.C. Complete Linkage Results Level 1  0 Level 2  0 Level 3  0.7071 Level 4  0 Level 5  0.7071 Level 6  1.0340 Level 7  1.0116 => Cut the dendrogram at a height between level 5 & 6

  33. Conclusion • Our aim is to evaluateclustering techniques on DMOZ Data. • Analysis on partitioning & hierarchical clustering algorithms. • If the experiments are succesfull we will apply same experiments on larger DMOZ data after we download it. • Else We will try other methodologies to improve our experiment results.

  34. References • [1]A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. • [2] Korenius T., Laurikkala J., Juhola M., Jarvelin K. Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments. Information Retrieval, 9(1). Kluwer Academic Publishers, 2006. • www.dmoz.org

More Related