Download
enhancing cluster labeling using wikipedia n.
Skip this Video
Loading SlideShow in 5 Seconds..
ENHANCING CLUSTER LABELING USING WIKIPEDIA PowerPoint Presentation
Download Presentation
ENHANCING CLUSTER LABELING USING WIKIPEDIA

ENHANCING CLUSTER LABELING USING WIKIPEDIA

793 Views Download Presentation
Download Presentation

ENHANCING CLUSTER LABELING USING WIKIPEDIA

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09

  2. Document Clustering • A method of aggregating a set of documents such that : • Documents within cluster are as similar as possible. • Documents from different clusters should be dissimilar. Cluster 2 Cluster 1 Cluster 3

  3. Cluster Labeling • To assign each cluster a human readable label that can best represent the cluster. • Traditional method is to pick the label from the important terms within the cluster. • The statistically significant terms may not be a good label. • A good label may not occur directly in the text. Electronics Bowling Ice Hockey Cluster 2 Cluster 1 Cluster 3

  4. Approach • Utilizing the external resource to help the cluster labeling. • Besides the important terms extracted from the cluster, the metadata of Wikipedia such as title and category is used to serve as candidate label.

  5. A General Framework i i i i i i

  6. Step1: Indexing • Documents are parsed and tokenized. • Term weight are determined by tf-idf. • Use Lucene to generate a search index such that the tf and idf value of term t can be quickly accessed.

  7. Step2: Clustering • Given the document collection D, return a set of document clusters C={C1,C2,…,Cn}. • A cluster is represented by its centroid of the cluster's documents. • The term weight of the cluster's centroid is slightly modified:

  8. Step3: Important Terms Extraction • Given a cluster , find a list of important terms ordered by their estimated importance. • This can be achieved by • Selecting the top weighted terms from the cluster centroid. • Use the Jensen-Shannon Divergence(JSD) to measure the distance between the cluster and the collection.

  9. Step4: Label Extraction • One way is to use the top k important terms directly. • The other way is to use the top k important terms to query Wikipedia. The title and the set of categories of the returned Wiki documents serve as candidate labels.

  10. Step5: Output the Recommended Labels from Candidate Labels • MI(Mutual Information) Judge • Score each candidate label by its pointwise mutual information with the cluster's important terms. • SP(Score Propagation) Judge • Propagate the document score to the candidate label. • Document score can be the original score of the IR system or the rank(d)-1 • Socore Aggregation • Use linear combination to combine the above two judges. • The recommend labels are the top ranked labels.

  11. Data Collection • 20 News Groups • 20 (clusters) X 1000 (documents/ clusters) • Open Directory Project(ODP) • 100 (clusters) X 100 (documents/ clusters) • The Ground Truth • The correct label itself. • The correct label's inflection. • The correct label's Wordnet synonym .

  12. Evaluation Metrics label1 label1 label1 label1 • Match@K • Ex: • Mean Reciprocal Rank(MRR@K) • Ex: label2 label2 label2 label2 label3 label3 label3 label3 • Match@4 • =1/2 • =0.5 Correct Correct label4 label4 label4 label4 c1 c1 c2 c2 • MRR@4 =((1/2)+(1/3))/2 • =0.416… Correct

  13. Parameters • The important term selection method(JSD, ctf-cdf-idf, MI, chi-square). • The number of important terms for querying Wikipedia. • The number of Wikipedia results to be used for label extraction. • The judges used for candidate evaluation.

  14. Evaluation 1 • The effectiveness of using Wikipedia to enhance cluster labeling.

  15. Evaluation 2 • Candidate label extraction

  16. Evaluation 3 • Judge effectiveness

  17. Evaluation 4.1 • The Effect of Clusters' Coherency on Label Quality • Testing on a "noisy cluster": • For a noise level p(in [0,1]) of clusters, each document in one cluster have probability p to swap with document in other cluster.

  18. Evaluation 4.2 • The Effect of Clusters' Coherency on Label Quality

  19. Conclusion • Proposed a general framework for solving cluster labeling problem. • The metadata of Wikipedia can boost the performance of cluster labeling. • The proposed method has good resiliency to noisy clusters.