1 / 12

Document Clustering with Cluster Refinement and Model Selection Capabilities

Document Clustering with Cluster Refinement and Model Selection Capabilities. Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors : Xin Liu, Yihong Gong, Wei Xu, Shenghuo Zhu. 2002 . SIGIR . Page(s) : 191 - 198. Outline. Motivation Objective Method Experimental Result

Download Presentation

Document Clustering with Cluster Refinement and Model Selection Capabilities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Clustering with Cluster Refinement and Model Selection Capabilities Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors :Xin Liu, Yihong Gong, Wei Xu, Shenghuo Zhu 2002 . SIGIR . Page(s) : 191 - 198

  2. Outline • Motivation • Objective • Method • Experimental Result • Conclusion • Personal Opinions

  3. Motivation • The problems and limitations: • The user must formulate the query using the keywords. • Traditional text search engines is a narrowly specified search for documents matching the user’s query. • Traditional search engine returns hundreds, or even thousands of hits.

  4. Objective • We propose a document clustering method that strives to achieve: • a high accuracy of document clustering • the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability)

  5. Method • Feature Set • Term frequencies (TF) • Name entities (NE) • Term pairs (TP) The documents reporting the Clinton-Lewinsky scandal The common name entities: ”Clinton”, ”Lewinsky”, ”Ken Starr”, ”Linda Tripp”, etc The word pairs: ”grand jury”, ”independent counsel”, ”supreme court”

  6. Apply the iterative voting scheme to refine the document clusters. GMM + EM algorithm GMM EM algorithm Method - self-refinement process

  7. Method - self-refinement process • Identify discriminative features F = {f1, f2, . . . , fΛ} along with cluster labels S = {σ1, σ2, . . . , σΛ} • Define the discriminative feature metric DFM(fi) • Compare the new document cluster set with C. • The result converges →terminate the process • Otherwise →set C to the new cluster set, and go to Step 2.

  8. Method - Model Selection • measure the similarity between C and C’ • The model selection algorithm • Guess the possible number of document clusters from the data range (Rl,Rh). • Set k = Rl. • Cluster the document corpus into k clusters. • Compute between each pair of the results, and take the average on all the . • If k < Rh, k = k + 1, go to Step 3; otherwise, go to Step 6. • Select the k which yields the largest average .

  9. Experimental Result - Document Clustering Evaluation • GMM + EM algorithm ABC+CNN-01-13-18-32-48-70-71-77-86 [新聞機構-新聞事件類別-報導次數]

  10. Experimental Result - Model Selection Evaluation • Compared with the BIC-based model selection method

  11. Conclusion • To accurately cluster the given document corpus by using the GMM Model together with EM algorithm. • The model selection capability has been achieved by guessing a value C for the number of clusters N.

  12. Personal Opinions • Advantage • high accuracy of document clustering • the model selection capability • Drawback • … • Application • …

More Related