Enhancing cluster labeling using wikipedia
Download
1 / 19

ENHANCING CLUSTER LABELING USING WIKIPEDIA - PowerPoint PPT Presentation


  • 744 Views
  • Uploaded on

ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09 Document Clustering A method of aggregating a set of documents such that : Documents within cluster are as similar as possible.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ENHANCING CLUSTER LABELING USING WIKIPEDIA' - bernad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Enhancing cluster labeling using wikipedia l.jpg

ENHANCING CLUSTER LABELING USING WIKIPEDIA

David Carmel, Haggai Roitman, Naama Zwerdling

IBM Research Lab

SIGIR’09


Document clustering l.jpg
Document Clustering

  • A method of aggregating a set of documents such that :

    • Documents within cluster are as similar as possible.

    • Documents from different clusters should be dissimilar.

Cluster 2

Cluster 1

Cluster 3


Cluster labeling l.jpg
Cluster Labeling

  • To assign each cluster a human readable label that can best represent the cluster.

  • Traditional method is to pick the label from the important terms within the cluster.

    • The statistically significant terms may not be a good label.

    • A good label may not occur directly in the text.

Electronics

Bowling

Ice Hockey

Cluster 2

Cluster 1

Cluster 3


Approach l.jpg
Approach

  • Utilizing the external resource to help the cluster labeling.

  • Besides the important terms extracted from the cluster, the metadata of Wikipedia such as title and category is used to serve as candidate label.


A general framework l.jpg
A General Framework

i

i

i

i

i

i


Step1 indexing l.jpg
Step1: Indexing

  • Documents are parsed and tokenized.

  • Term weight are determined by tf-idf.

  • Use Lucene to generate a search index such that the tf and idf value of term t can be quickly accessed.


Step2 clustering l.jpg
Step2: Clustering

  • Given the document collection D, return a set of document clusters C={C1,C2,…,Cn}.

  • A cluster is represented by its centroid of the cluster's documents.

  • The term weight of the cluster's centroid is slightly modified:


Step3 important terms extraction l.jpg
Step3: Important Terms Extraction

  • Given a cluster , find a list of important terms ordered by their estimated importance.

  • This can be achieved by

    • Selecting the top weighted terms from the cluster centroid.

    • Use the Jensen-Shannon Divergence(JSD) to measure the distance between the cluster and the collection.


Step4 label extraction l.jpg
Step4: Label Extraction

  • One way is to use the top k important terms directly.

  • The other way is to use the top k important terms to query Wikipedia. The title and the set of categories of the returned Wiki documents serve as candidate labels.


Step5 output the recommended labels from candidate labels l.jpg
Step5: Output the Recommended Labels from Candidate Labels

  • MI(Mutual Information) Judge

    • Score each candidate label by its pointwise mutual information with the cluster's important terms.

  • SP(Score Propagation) Judge

    • Propagate the document score to the candidate label.

      • Document score can be the original score of the IR system or the rank(d)-1

  • Socore Aggregation

    • Use linear combination to combine the above two judges.

    • The recommend labels are the top ranked labels.


Data collection l.jpg
Data Collection

  • 20 News Groups

    • 20 (clusters) X 1000 (documents/ clusters)

  • Open Directory Project(ODP)

    • 100 (clusters) X 100 (documents/ clusters)

  • The Ground Truth

    • The correct label itself.

    • The correct label's inflection.

    • The correct label's Wordnet synonym .


Evaluation metrics l.jpg
Evaluation Metrics

label1

label1

label1

label1

  • [email protected]

    • Ex:

  • Mean Reciprocal Rank([email protected])

    • Ex:

label2

label2

label2

label2

label3

label3

label3

label3

  • [email protected]

  • =1/2

  • =0.5

Correct

Correct

label4

label4

label4

label4

c1

c1

c2

c2

  • [email protected] =((1/2)+(1/3))/2

  • =0.416…

Correct


Parameters l.jpg
Parameters

  • The important term selection method(JSD, ctf-cdf-idf, MI, chi-square).

  • The number of important terms for querying Wikipedia.

  • The number of Wikipedia results to be used for label extraction.

  • The judges used for candidate evaluation.


Evaluation 1 l.jpg
Evaluation 1

  • The effectiveness of using Wikipedia to enhance cluster labeling.


Evaluation 2 l.jpg
Evaluation 2

  • Candidate label extraction


Evaluation 3 l.jpg
Evaluation 3

  • Judge effectiveness


Evaluation 4 1 l.jpg
Evaluation 4.1

  • The Effect of Clusters' Coherency on Label Quality

  • Testing on a "noisy cluster":

    • For a noise level p(in [0,1]) of clusters, each document in one cluster have probability p to swap with document in other cluster.


Evaluation 4 2 l.jpg
Evaluation 4.2

  • The Effect of Clusters' Coherency on Label Quality


Conclusion l.jpg
Conclusion

  • Proposed a general framework for solving cluster labeling problem.

  • The metadata of Wikipedia can boost the performance of cluster labeling.

  • The proposed method has good resiliency to noisy clusters.


ad