joint optimization of wrapper generation and template detection n.
Skip this Video
Loading SlideShow in 5 Seconds..
Joint Optimization of Wrapper Generation and Template Detection PowerPoint Presentation
Download Presentation
Joint Optimization of Wrapper Generation and Template Detection

play fullscreen
1 / 28
Download Presentation

Joint Optimization of Wrapper Generation and Template Detection - PowerPoint PPT Presentation

colm
145 Views
Download Presentation

Joint Optimization of Wrapper Generation and Template Detection

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Joint Optimization of Wrapper Generation and Template Detection Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA

  2. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  3. Motivations Page Generation Script (e.g., ASP, PHP, JSP) Encoding Database Decoding Wrapper SIGKDD-2007, San Jose, California, USA

  4. Related Work • Some automatic or semi-automatic wrapper learning methods have been proposed • e.g. WIEN[12], SoftMeley,[11] Stalker[17], RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc. • Page clustering for wrapper induction is considered a trivial task • Manual: most of previous work • Automatic but isolated from wrapper generation: RoadRunner[6,7] and [18] SIGKDD-2007, San Jose, California, USA

  5. Problems (cont.) • Dynamic URLs • With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before SIGKDD-2007, San Jose, California, USA

  6. (a): …/gp/product/B000BNLGJA/ (b): …/gp/product/B00007J8SC/ (c): …/gp/product/B0000DD95R/ (d): …/gp/product/B0000A1AT9/ (a): www.amazon.com/gp/product/B000BNLGJA/ (b): www.amazon.com/gp/product/B00007J8SC/ (c): www.amazon.com/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ SIGKDD-2007, San Jose, California, USA

  7. Problems • Dynamic URLs • With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before • Complex Templates • Even if URLs can group pages that share a template, such a method is sometimes far from optimal to generate only one wrapper for a complex template SIGKDD-2007, San Jose, California, USA

  8. (c): www.amazon.com/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ SIGKDD-2007, San Jose, California, USA

  9. Our Proposed Approach • Main ideas • Similarity-based templates, instead of ground-truth templates • Advantages • Be more stable • Optimize the number of wrappers SIGKDD-2007, San Jose, California, USA

  10. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  11. Problem Definition SIGKDD-2007, San Jose, California, USA

  12. System Overview SIGKDD-2007, San Jose, California, USA

  13. Wrapper Generation [6, 4, 18] SIGKDD-2007, San Jose, California, USA

  14. Wrapper-DOM Distance • Distance between a wrapper and a DOM tree • Tree alignment • Cost calculation SIGKDD-2007, San Jose, California, USA

  15. Wrapper-Oriented Page Clustering (WPC) (a) Level-1 Wrapper (b) Level-2 Wrapper (c) Level-3 Wrapper (d) Level-4 Wrapper SIGKDD-2007, San Jose, California, USA

  16. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  17. Experiments • Data • 1700 product pages from Amazon.com (Amazon) • Mixed 1000 pages from 10 shopping sites (M10) • Target product records: (name, image, price) • Settings • 2-fold cross-validation • Evaluation measures: Precision, Recall and F1 SIGKDD-2007, San Jose, California, USA

  18. Effectiveness Test • Amazon: 44 wrappers, F1: 94.88% vs. 78% • M10: SIGKDD-2007, San Jose, California, USA

  19. WPC with Different Thresholds SIGKDD-2007, San Jose, California, USA

  20. Stability Test • Objective • Evaluate how the choice of initial training page impacts the performance of WPC SIGKDD-2007, San Jose, California, USA

  21. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  22. Demo! Microsoft Office Excel 2007 Web Data Add-In is coming soon! Please have a try in two weeks! http://blogs.msdn.com/xaw SIGKDD-2007, San Jose, California, USA

  23. Outline • Introduction • Our approach • Experiments • Demo • Conclusion SIGKDD-2007, San Jose, California, USA

  24. Conclusion • Our system • Takes a miscellaneous training set as input • Conducts template detection and wrapper generation in a single step • Can achieve a joint optimization under the criterion of extraction accuracy • In the near future, • We will extend the approach to handle the templates containing content strings SIGKDD-2007, San Jose, California, USA

  25. Thanks! Contacts: Ruihua Song (rsong@microsoft.com) Shuyi Zheng (shzheng@cse.psu.edu) SIGKDD-2007, San Jose, California, USA

  26. Poster No. 11 • Looking forward to talking with you at Poster Reception II this evening! SIGKDD-2007, San Jose, California, USA

  27. Backup Slides SIGKDD-2007, San Jose, California, USA

  28. Labeling Cost • To show how many training pages are required for learning wrappers to achieve an accuracy higher than 95% in terms of F1. SIGKDD-2007, San Jose, California, USA