1 / 28

Joint Optimization of Wrapper Generation and Template Detection

Joint Optimization of Wrapper Generation and Template Detection. Shuyi Zheng, Di Wu, Ruihua Song , Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA. Outline. Introduction Our approach Experiments Demo Conclusion. Motivations. Page Generation Script

colm
Download Presentation

Joint Optimization of Wrapper Generation and Template Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joint Optimization of Wrapper Generation and Template Detection Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA

  2. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  3. Motivations Page Generation Script (e.g., ASP, PHP, JSP) Encoding Database Decoding Wrapper SIGKDD-2007, San Jose, California, USA

  4. Related Work • Some automatic or semi-automatic wrapper learning methods have been proposed • e.g. WIEN[12], SoftMeley,[11] Stalker[17], RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc. • Page clustering for wrapper induction is considered a trivial task • Manual: most of previous work • Automatic but isolated from wrapper generation: RoadRunner[6,7] and [18] SIGKDD-2007, San Jose, California, USA

  5. Problems (cont.) • Dynamic URLs • With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before SIGKDD-2007, San Jose, California, USA

  6. (a): …/gp/product/B000BNLGJA/ (b): …/gp/product/B00007J8SC/ (c): …/gp/product/B0000DD95R/ (d): …/gp/product/B0000A1AT9/ (a): www.amazon.com/gp/product/B000BNLGJA/ (b): www.amazon.com/gp/product/B00007J8SC/ (c): www.amazon.com/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ SIGKDD-2007, San Jose, California, USA

  7. Problems • Dynamic URLs • With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before • Complex Templates • Even if URLs can group pages that share a template, such a method is sometimes far from optimal to generate only one wrapper for a complex template SIGKDD-2007, San Jose, California, USA

  8. (c): www.amazon.com/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ SIGKDD-2007, San Jose, California, USA

  9. Our Proposed Approach • Main ideas • Similarity-based templates, instead of ground-truth templates • Advantages • Be more stable • Optimize the number of wrappers SIGKDD-2007, San Jose, California, USA

  10. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  11. Problem Definition SIGKDD-2007, San Jose, California, USA

  12. System Overview SIGKDD-2007, San Jose, California, USA

  13. Wrapper Generation [6, 4, 18] SIGKDD-2007, San Jose, California, USA

  14. Wrapper-DOM Distance • Distance between a wrapper and a DOM tree • Tree alignment • Cost calculation SIGKDD-2007, San Jose, California, USA

  15. Wrapper-Oriented Page Clustering (WPC) (a) Level-1 Wrapper (b) Level-2 Wrapper (c) Level-3 Wrapper (d) Level-4 Wrapper SIGKDD-2007, San Jose, California, USA

  16. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  17. Experiments • Data • 1700 product pages from Amazon.com (Amazon) • Mixed 1000 pages from 10 shopping sites (M10) • Target product records: (name, image, price) • Settings • 2-fold cross-validation • Evaluation measures: Precision, Recall and F1 SIGKDD-2007, San Jose, California, USA

  18. Effectiveness Test • Amazon: 44 wrappers, F1: 94.88% vs. 78% • M10: SIGKDD-2007, San Jose, California, USA

  19. WPC with Different Thresholds SIGKDD-2007, San Jose, California, USA

  20. Stability Test • Objective • Evaluate how the choice of initial training page impacts the performance of WPC SIGKDD-2007, San Jose, California, USA

  21. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  22. Demo! Microsoft Office Excel 2007 Web Data Add-In is coming soon! Please have a try in two weeks! http://blogs.msdn.com/xaw SIGKDD-2007, San Jose, California, USA

  23. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  24. Conclusion • Our system • Takes a miscellaneous training set as input • Conducts template detection and wrapper generation in a single step • Can achieve a joint optimization under the criterion of extraction accuracy • In the near future, • We will extend the approach to handle the templates containing content strings SIGKDD-2007, San Jose, California, USA

  25. Thanks! Contacts: Ruihua Song (rsong@microsoft.com) Shuyi Zheng (shzheng@cse.psu.edu) SIGKDD-2007, San Jose, California, USA

  26. Poster No. 11 • Looking forward to talking with you at Poster Reception II this evening! SIGKDD-2007, San Jose, California, USA

  27. Backup Slides SIGKDD-2007, San Jose, California, USA

  28. Labeling Cost • To show how many training pages are required for learning wrappers to achieve an accuracy higher than 95% in terms of F1. SIGKDD-2007, San Jose, California, USA

More Related