enhancing cluster labeling using wikipedia
Download
Skip this Video
Download Presentation
ENHANCING CLUSTER LABELING USING WIKIPEDIA

Loading in 2 Seconds...

play fullscreen
1 / 19

ENHANCING CLUSTER LABELING USING WIKIPEDIA - PowerPoint PPT Presentation


  • 745 Views
  • Uploaded on

ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09 Document Clustering A method of aggregating a set of documents such that : Documents within cluster are as similar as possible.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ENHANCING CLUSTER LABELING USING WIKIPEDIA' - bernad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
enhancing cluster labeling using wikipedia

ENHANCING CLUSTER LABELING USING WIKIPEDIA

David Carmel, Haggai Roitman, Naama Zwerdling

IBM Research Lab

SIGIR’09

document clustering
Document Clustering
  • A method of aggregating a set of documents such that :
    • Documents within cluster are as similar as possible.
    • Documents from different clusters should be dissimilar.

Cluster 2

Cluster 1

Cluster 3

cluster labeling
Cluster Labeling
  • To assign each cluster a human readable label that can best represent the cluster.
  • Traditional method is to pick the label from the important terms within the cluster.
    • The statistically significant terms may not be a good label.
    • A good label may not occur directly in the text.

Electronics

Bowling

Ice Hockey

Cluster 2

Cluster 1

Cluster 3

approach
Approach
  • Utilizing the external resource to help the cluster labeling.
  • Besides the important terms extracted from the cluster, the metadata of Wikipedia such as title and category is used to serve as candidate label.
step1 indexing
Step1: Indexing
  • Documents are parsed and tokenized.
  • Term weight are determined by tf-idf.
  • Use Lucene to generate a search index such that the tf and idf value of term t can be quickly accessed.
step2 clustering
Step2: Clustering
  • Given the document collection D, return a set of document clusters C={C1,C2,…,Cn}.
  • A cluster is represented by its centroid of the cluster's documents.
  • The term weight of the cluster's centroid is slightly modified:
step3 important terms extraction
Step3: Important Terms Extraction
  • Given a cluster , find a list of important terms ordered by their estimated importance.
  • This can be achieved by
    • Selecting the top weighted terms from the cluster centroid.
    • Use the Jensen-Shannon Divergence(JSD) to measure the distance between the cluster and the collection.
step4 label extraction
Step4: Label Extraction
  • One way is to use the top k important terms directly.
  • The other way is to use the top k important terms to query Wikipedia. The title and the set of categories of the returned Wiki documents serve as candidate labels.
step5 output the recommended labels from candidate labels
Step5: Output the Recommended Labels from Candidate Labels
  • MI(Mutual Information) Judge
    • Score each candidate label by its pointwise mutual information with the cluster's important terms.
  • SP(Score Propagation) Judge
    • Propagate the document score to the candidate label.
      • Document score can be the original score of the IR system or the rank(d)-1
  • Socore Aggregation
    • Use linear combination to combine the above two judges.
    • The recommend labels are the top ranked labels.
data collection
Data Collection
  • 20 News Groups
    • 20 (clusters) X 1000 (documents/ clusters)
  • Open Directory Project(ODP)
    • 100 (clusters) X 100 (documents/ clusters)
  • The Ground Truth
    • The correct label itself.
    • The correct label's inflection.
    • The correct label's Wordnet synonym .
evaluation metrics
Evaluation Metrics

label1

label1

label1

label1

label2

label2

label2

label2

label3

label3

label3

label3

Correct

Correct

label4

label4

label4

label4

c1

c1

c2

c2

Correct

parameters
Parameters
  • The important term selection method(JSD, ctf-cdf-idf, MI, chi-square).
  • The number of important terms for querying Wikipedia.
  • The number of Wikipedia results to be used for label extraction.
  • The judges used for candidate evaluation.
evaluation 1
Evaluation 1
  • The effectiveness of using Wikipedia to enhance cluster labeling.
evaluation 2
Evaluation 2
  • Candidate label extraction
evaluation 3
Evaluation 3
  • Judge effectiveness
evaluation 4 1
Evaluation 4.1
  • The Effect of Clusters' Coherency on Label Quality
  • Testing on a "noisy cluster":
    • For a noise level p(in [0,1]) of clusters, each document in one cluster have probability p to swap with document in other cluster.
evaluation 4 2
Evaluation 4.2
  • The Effect of Clusters' Coherency on Label Quality
conclusion
Conclusion
  • Proposed a general framework for solving cluster labeling problem.
  • The metadata of Wikipedia can boost the performance of cluster labeling.
  • The proposed method has good resiliency to noisy clusters.
ad