1 / 26

Classifying Tags Using Open Content Resources

Classifying Tags Using Open Content Resources. Simon Overell , Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09. Motivation. Classify tags in Flickr as broad categories such as what , where , when and who Easier indexing and navigation

havily
Download Presentation

Classifying Tags Using Open Content Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifying Tags Using Open Content Resources Simon Overell, BorkurSigurbjornsson & Roelof van Zwol WSDM ‘09

  2. Motivation • Classify tags in Flickr as broad categories such as what, where, when and who • Easier indexing and navigation • WordNet is usually used for classification but has limited coverage

  3. Example

  4. The ClassTag System

  5. Classifying Wikipedia Articles • Using only metadata (i.e. Categories and Templates) – high scalability • Supervised Classifier • Articles as objects • WordNet noun semantic categories as classification classes • Categories and Templates as features • Support Vector Machine (SVM) as classifier

  6. Categories and Templates

  7. Categories and Templates

  8. Supervised Classification • Ground Truth • All Wikipedia articles that match WordNet nouns • Data Sparsity • WordNet categories under represented (10 out of 25) • Articles have very few features

  9. Reducing Data Sparsity • Using category and template network transclusion • … but noise is added

  10. System Optimization • Number of arcs traversed in • Category network • Template network • Choice of weighting function • Term Frequency (tf) • Term Frequency – Inverse Document Frequency (tf-idf) • Term Frequency – Inverse Layer (tf-il)

  11. Example

  12. Fine Tuning • Partitioned the ground truth into training and test sets • Criteria • At least 80% precision • Maximum possible recall • Resulted optimal values • Category arcs: 3, Template arcs: 3, TF-IL • Precision: 87% F1-Measure:0.696

  13. SVM Threshold • SVM outputs confidence with which an article is correctly classified as a member of a category • Training experiment with 250 Wikipedia articles (1 assessor)

  14. SVM Threshold

  15. SVM Threshold

  16. Summary • Optimised for Recall (ClassTag) • 39% of Articles classified • 664,770 Wikipedia articles • Optimised for Precision (ClassTag+) • 21% of Articles classified • 338,061 Wikipedia articles

  17. Comparison with DBpedia • Experimental Setup • 300 pooled articles • 3 Assessors • Blind Assessments • 50 articles overlap • Partial Agreement: • 86% • Total Agreement: • 78%

  18. Results

  19. Classification of Flickr Tags • Tag  Anchor Text • String matching • Anchor Text  Wikipedia Article • Number of times an anchor refers to a Wikipedia article • Wikipedia Article  Category • Output of SVM decision

  20. Ambiguity • Tag  Anchor Text • Some ambiguity because often tags are lower case with no white spaces • Anchor Text  Wikipedia Article • 13.4% of Anchor text -> Wikipedia Article mappings ambiguous • 4% of Anchor text -> Category mappings ambiguous • Example • George Bush -> George W. Bush, George Bush Senior • George Bush -> Person • Wikipedia Article  Category • 5.7% of classified articles result in multiple classification

  21. Example

  22. Evaluation • WordNet classification extended vocabulary coverage by 115% • Taking tag frequency into account • ClassTag classified 69.2% of Flickr tags • 22% more than WordNet baseline

  23. Tag distribution

  24. Multilanguage Classification • 80% of tags in English, 7% in German and 6% in Dutch • Maybe a portion of the unclassified tags fall into this category • Possible alternate language classification • Run ClassTag using alternate Wikipedia language and a corresponding lexicon • Translate the English classification using Wikipedia’s interlanguage links

  25. Contributions • Classifying open content resources using their structural patterns • Presenting ClassTag- a system for classifying tags • ClassTag extends the WordNet lexicon using the structural patterns of Wikipedia

  26. Conclusion • Tuneable system for classifying Wikipedia pages • ClassTag: Nearly 40% of articles classified with a precision of 72% • ClassTag+: 21% of articles classified with a precision of 86% (equal to assessor agreement) • Nearly 70% of Flickr tags matched to WordNet categories

More Related