1 / 25

A matrix density based algorithm to hierarchically co-cluster documents and words

A matrix density based algorithm to hierarchically co-cluster documents and words. Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Bhushan Mandhani Sachindra Joshi Krishna Kummamuru. outline. Motivation Objective Introduction background

quinn-beach
Download Presentation

A matrix density based algorithm to hierarchically co-cluster documents and words

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A matrix density based algorithm to hierarchically co-cluster documents and words Advisor : Dr. Hsu Graduate:Keng-Wei Chang Author :Bhushan Mandhani Sachindra Joshi Krishna Kummamuru

  2. outline • Motivation • Objective • Introduction • background • Rowset Partitioning and Submatrix Agglomeration(RPSA) • Experimental results • Conclusions • Personal Opinion

  3. Motivation • With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.

  4. Objective • A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo、Google. • This paper proposes an algorithm to hierarchically cluster documents for solving problems.

  5. Introduction • 90s -> 100 thousand pages; • 2002 -> 2 billion pages; • it has become increasingly important to organize the information • Manually is accurate, but not always feasible • Need tools to automatically arrange documents to labeled hierarchies • Propose RPSA -> two step partitional-agglomerative

  6. background • Vector Model for Documents • Evaluation of Clustering Quality • Evaluation of Hierarchical Clustering

  7. Vector Model for Documents Unitized-TF IDF We have d documents Document i is represented by is the number of occurrences of word j in document i Term Frequency,TF Inverse Document Frequency,IDF

  8. Evaluation of Clustering Quality • 1. Purity: • 2. Entropy:

  9. Evaluation of Hierarchical Clustering

  10. Rowset Partitioning and Submatrix Agglomeration(RPSA) • tow-step partitional-agglomerative algorithm • 1th step:The Partitioning Step • 2th step:The Agglomerative Step

  11. The Partitioning Step • Define the density of submatices a row r,a column c a set R of rows,a set C of columns

  12. The Partitioning Step • Generating a Leaf Cluster

  13. The Partitioning Step • Choice of Leader Documents • The sum of TFIDF vector representing that document • Documents with relatively large lengths were observed to be better leader documents for the algorithm above

  14. The Partitioning Step • The Complete Partitioning Algorithm

  15. The Partitioning Step • Complexity Analysis • The time complexity is O(mz) • The space complexity is O(z)

  16. The Agglomerative Step • Reduce the number of clusters • The similarity measure between two clusters for merging • Flat Clustering • Hierarchical Clustering

  17. The Agglomerative Step • Complexity Analysis • The time complexity is O( ) • The space complexity is O( )

  18. Experimental results-Flat Clustering • Data Sets

  19. Experimental results-Flat Clustering • Results

  20. Experimental results-Flat Clustering

  21. Experimental results-Hierarchical Clustering • Data Sets

  22. Experimental results-Hierarchical Clustering • Data Sets

  23. Experimental results-Hierarchical Clustering • Results

  24. Conclusions • It is comparable with or better than the best k-means run • It’s performance does not degrade on small data sets • It’s acceptable on purity in hierarchy

  25. Personal Opinion

More Related