1 / 41

Web Taxonomy Integration through Co-Bootstrapping

Web Taxonomy Integration through Co-Bootstrapping. Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04. Introduction. Problem Statement. Games > Roleplaying Final Fantasy Fan Dragon Quest Home EverQuest Addict Warcraft III Clan

Download Presentation

Web Taxonomy Integration through Co-Bootstrapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Taxonomy Integration throughCo-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04

  2. Introduction

  3. Problem Statement • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • EverQuest Addict • Warcraft III Clan • Games > Strategy • Shogun: Total War • Warcraft III Clan • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • Games > Online • EverQuest Addict • Warcraft III Clan • Games > Single-Player • Warcraft III Clan

  4. Possible Approach Classify Train • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • EverQuest Addict • Warcraft III Clan ignores original Yahoo! categories

  5. Another Approach (1/2) • Use Yahoo! categories • Advantage • similar categories • Potential Problem • different structure • categories do not match exactly

  6. Another Approach (2/2) • Example: Crayon Shin-chan Entertainment > Comics and Animation > Animation > Anime > Titles > Crayon Shin-chan Arts > Animation > Anime > Titles > C > Crayon Shin-chan

  7. This Paper’s Approach • Weak Learner (as opposed to Naïve Bayes) • Boosting to combine Weak Hypotheses • New Idea: Co-Bootstrapping to exploit source categories

  8. Assumptions • Multi-category data are reduced to binary data • Totoro Fan Cartoon > My Neighbor Totoro Toys > My Neighbor Totoro is converted into Totoro Fan Cartoon > My Neighbor Totoro Totoro Fan Toys > My Neighbor Totoro • Hierarchies are ignored • Console > Sega and Console > Sega > Dreamcast are not related

  9. Weak Learner • Boosting • Co-Bootstrapping Weak Learner

  10. Weak Learner • A type of classifier similar to Naïve Bayes • + = accept • - = reject • Term may be a word or n-gram or … Weak Hypothesis (term-based classifier) After Training Weak Learner

  11. Weak Hypothesis Example • contain “Crayon Shin-chan”  • in “Comics > Crayon Shin-chan” • not in “Education > Early Childhood” • not contain “Crayon Shin-chan”  • not in “Comics > Crayon Shin-chan” • in “Education > Early Childhood”

  12. Weak Learner Inputs (1/2) • Training data are in the form [x1, y1], [x2, y2], …, [xm, ym] • xiis a document • yi is a category • [xi,yi] means document xi is in category yi • D(x, y) is a distribution over all combinations of xi and yi • D(xi, yj) indicates the “importance” of (xi, yj) • w is the term (automatically found)

  13. Weak Learner Algorithm For each possible category y, compute four values: Note: (xi,y) with greater D(xi,y) has more influence.

  14. Weak Hypothesis h(x, y) • Given unclassified document x and category y • If x contains w, then • Else if x does not contain w, then

  15. Weak Learner Comments • If sign[ h(x,y) ] = +, then x is in y • | h(x,y) |is the confidence • The term w is found as follows: • Repeatedly run weak learner for all possible w • Choose the run with the smallest value as the model • Boosting: Minimizes probability of h(x,y) having wrong sign

  16. Weak Learner • Boosting • Co-Bootstrapping Boosting (AdaBoost.MH)

  17. Boosting Idea • Train the weak learner on different Dt(x, y) distributions • After each run, adjust Dt(x, y) by putting more weight on the most often misclassified training data • Output the final hypothesis as a linear combination of weak hypotheses

  18. Boosting Algorithm Given: [x1, y1], [x2, y2], …, [xm, ym], where xi X and yi  Y Initialize D1(x,y) = 1/(mk) fort = 1,…,Tdo Pass distribution Dt to weak learner Get weak hypothesis ht(x, y) Choose t  R Update end for Output the final hypothesis

  19. Boosting Algorithm Initialization Given: [x1, y1], [x2, y2], …, [xm, ym] Initialize D(x, y) = 1/(mk) • k = total number of categories • uniform distribution

  20. Boosting Algorithm Loop fort = 1,…,Tdo • Run weak learner using distribution D • Get weak hypothesis ht(x, y) • For each possible pair (x,y) in training data • If ht(x,y) guesses incorrectly, increase D(x,y) end for return

  21. Weak Learner • Boosting • Co-Bootstrapping Co-Bootstrapping

  22. Co-Bootstrapping Idea • We want to use Yahoo! categories to increase classification accuracy

  23. Recall Example Problem • Games > Roleplaying • Final Fantasy Fan • Dragon Quest Home • Games > Strategy • Shogun: Total War • Games > Online • EverQuest Addict • Warcraft III Clan • Games > Single-Player • Warcraft III Clan

  24. Co-Bootstrapping Algorithm (1/4) • 2. Run AdaBoost on Google sites • Get classifier G1 • 1. Run AdaBoost on Yahoo! sites • Get classifierY1 • 3. Run Y1 on Google sites • Get predicted Yahoo! categories for Google sites • 4. Run G1 on Yahoo! sites • Get predicted Google categories for Yahoo! sites

  25. Co-Bootstrapping Algorithm (2/4) • 6. Run AdaBoost on Google sites • Include Yahoo! category as a feature • Get classifier G2 • 5. Run AdaBoost on Yahoo! sites • Include Google category as a feature • Get classifierY2 • 7. Run Y2 on original Google sites • get more accurate Yahoo! categories for Google sites • 8. Run G2 on original Yahoo! sites • get more accurateGoogle categories for Yahoo! sites

  26. Co-Bootstrapping Algorithm (3/4) • 10. Run AdaBoost on Google sites • Include Yahoo! category as a feature • Get classifier G3 • 9. Run AdaBoost on Yahoo! sites • Include Google category as a feature • Get classifierY3 • 11. Run Y3 on original Google sites • get even more accurate Yahoo! categories for Google sites • 12. Run G3 on original Yahoo! sites • get even more accurateGoogle categories for Yahoo! sites

  27. Co-Bootstrapping Algorithm (4/4) • Repeat, repeat, and repeat… • Hopefully, the classification will become more accurate after each iteration…

  28. Enhanced Naïve Bayes (Benchmark)

  29. Enhanced Naïve Bayes (1/2) • Given • document x • source category S of x • Predict master category C • In NB, Pr[C | x]  Pr[C] wx(Pr[w | C])n(x,w) • w : word • n(x,w) number of occurrences of w in x • Pr[C | x, S]  Pr[C | S] wx(Pr[w | C])n(x,w)

  30. Enhanced Naïve Bayes (2/2) • Pr[C] = • Estimate Pr[C | S]  • |C  S| : number of docs in S that is classified into C by NB classifier

  31. Experiment

  32. Datasets

  33. Number of Categories*/Dataset (1/2) *Top level categories only

  34. Number of Categories*/Dataset (2/2) • Book • Horror • Science Fiction • Non-fiction • Biography • History Merge into Non-fiction

  35. Number of Websites

  36. Method (1/2) • Classify Yahoo! Book websites into Google Book categories (GY) • Find GY for Book • Hide Google categories for in GY • GYYahoo! Book • Randomly take |GY| sites from G-Y Google Book

  37. Method (2/2) • For each dataset, do GY five times and GY five times • macro F-score : calculate F-score for each category, then average over all categories • micro F-score : calculate F-score on the entire dataset • recall = 100%? • Doesn’t say anything about multi-category ENB

  38. Results (1/3) • Co-Boostrapping-AdaBoost > AdaBoost macro-averaged F scores micro-averaged F scores

  39. Results (2/3) • Co-Bootstrapping-AdaBoost iteratively improves AdaBoost Book Dataset

  40. Results (3/3) • Co-Boostrapping-AdaBoost > Enhanced Naïve Bayes macro-averaged F scores micro-averaged F scores

  41. Contribution • Co-Bootstrapping improves Boosting performance • Does not require  as in ENB

More Related