1 / 23

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309

Improving Web Page Classification by Label-propagation over Click Graphs Soo -Min Kim, Patrick Pantel , Lei Duan and Scott Gaffney Yahoo ! Labs CIKM 2009. Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309. Outlines.

callie
Download Presentation

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Web Page Classification by Label-propagation over Click GraphsSoo-Min Kim, Patrick Pantel, Lei Duan and Scott GaffneyYahoo ! Labs CIKM 2009 Presenter: Lung-Hao Lee (李龍豪) January 7, 2010@Room 309

  2. Outlines • Introduction • Calculating Page Similarity • Finding Similar Pages • Click Data Model (CDM) • Query Constraint (QC) algorithm • Experimental Results • Discussion • Conclusion

  3. Introduction • Large labor cost of annotating the data • The aggregated click data across many users over time provides valuable information • Leveraging click logs to argument training data by propagating class labels to unlabeled similar documents

  4. Hypothesis • “Two pages that tend to be clicked by the same user queries tend to be topically similar” “How to tie a neck tie knots ” “How to tie a tie” “Tying a tie” Unknown Label “Positive” ? Label as “Positive” (class “How-to”) A B

  5. Calculating Page Similarity (1/3) • A page is represented as a node in the similar graph • Normalize all the URLs e.g. the following 4 URLs are treated as the same • “http://www.acm.org” • “www.acm.org” • “www.acm.org/” • “http://www.acm.org/”

  6. Calculating Page Similarity (2/3) • Each URL is represented as a vector of queries that users issued and clicked through to the page Pantel & Lin (2002)

  7. Calculating Page Similarity (3/3) • Compute the similarity between two pages using the cosine similarity of their respective feature vector • sim (p1,p2) > sim (p1,p3) • sim (p1,p2) > sim(p2,p3) Because p1 and p2 share more common queries than p3

  8. Finding Similar Pages Given Seed Sets • What’s a “seed set” ? A set of some labeled data • Two algorithms for seed set expansion • Click Data Model (CDM) • Query Constraints (QC) algorithm

  9. Click Data Model • Two phases • Updating score phase • Filtering phase • Input • S1 (positive set) • S2 (negative set) • G (click graph) • Output • E1 (positive) • E2 (negative) • Thresholds • 0.1<T1<0.6 • 0.6<T2<1.2

  10. Query Constraints • Additional Module that checks whether the common queries between two nodes have certain term patterns

  11. Active Learning • Reduce the amount of human annotation effort by leveraging the click data • Build an expansion model with labeled training data and use it to select next round of training data

  12. Experimental Setup (1/3) • Click Data • During December 2008 from Yahoo! Search engine • Only the top 10 URLs are considered • URLs with less than 10 clicks are excluded • Tree classification tasks • How-to • Adult • review

  13. Experimental Setup (2/3) • Training sets • 10,000 manually labeled positive and negative examples • For “review” classifier, queries such as “digital camera reviews” or “baby swing reviews” • For “How-to” classifier, queries such as “how to clean uggs” or “best way to loose weight” • Testing sets

  14. Experimental Setup (3/3) • Classifier • Gradient Boosting Decision Tree (GBDT) • Features • Textual, Link, URL, HTML, Other features • Metrics • Area Under the ROC Curve (AUC) (Fawcett, 2003) • F score • Accuracy

  15. Label Propagation Analysis: Adult • The big improvement of CDM is observed with a model using 5000 labeled data as a seed set (+1.07% in F-score, +0.81 in Accuracy and +0.25% in AUC)

  16. Label Propagation Analysis: Review • Reduce the manual labor by 50% • QC (exclude pages that do not have “review” in query terms) is useful when labeled data is small

  17. Label Propagation Analysis: How-to • With 1000 and 2000 human labeled data, CDM performs worse than the baseline • QC (exclude pages that do not have “How-to” in query terms)

  18. A comparison between baseline and CDM • Baseline: Type A • CDM: Type C

  19. Evaluating the Active Learning Approach • From “How-to” Classifier • Seed 1Seed 2 (human label from Expnd1) Expand2

  20. Intrinsic Analysis of CDM Expanded Data • A random sample of 50 positive and 50 negative example from “how-to” classifier • Positive class has 82.3% precision whereas negative class has 83.6% precision

  21. Discussion • Is the proposed method always useful for web page classification ? • How can we improve the quality of automatically labeled data from unlabeled data ?

  22. Conclusion • Present a method for improve webpage classification by leveraging click data to augment training data • Argument manually labeled data by modeling the similarity between pages in a click graph

  23. The End • Thank you very much • Questions & Answers

More Related