1 / 28

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums. Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04. Web Forum Data. An important information resource with a lot of human knowledge.

gino
Download Presentation

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04

  2. Web Forum Data • An important information resource with a lot of human knowledge. • These information include recreation, sports, games, computers, art, society, science, home, health; • 20% pages on the search results are from forums

  3. Understanding Forum Crawling Data Extraction Quality Assessment WWW’08 iRobot: An Intelligent Crawler for Web Forums SIGIR’08 Exploring Traversal Strategy KDD’09 Incremental Crawling WWW’09, Automation Data Extraction SIGIR’09 Quality Assessment

  4. Challenge • Leverage more site-level knowledge

  5. ForumSitemap • A sitemap is a directed graph corresponding consisting of a set of vertices and the links • Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference

  6. PageClustering • Forum pages are based on database & template • Layout is robust to describe template • Layout can be characterized by the HTML elements in different DOM paths

  7. Page Clustering Dom Path Feature Discovery Clustering by Virtual Tables

  8. Link Analysis A Link = URL Pattern + Location

  9. Inner-Page Features • The inclusion relation. Data records usually have inclusion relations. • The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page. • Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.

  10. Inner-vertex Features

  11. Inter-vertex Features

  12. Problem Setting Author Title Content

  13. Formulas of list page • Formulas for identifying list record • Formulas for identifying list title

  14. Formulas of post page • Formulas for identifying post record • Formulas for identifying post author

  15. Formulas of post page • Formulas for identifying post time • Formulas for identifying post content

  16. Markov Logic Networks • An MLN can be viewed as a template for constructing Markov Random Fields. • With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:

  17. Markov Logic Networks • Divide DOM tree elements into three categories : • Text element • Hyperlink element • Inner element • Benefit • Reduce the number of possible groundings in inference. • Reduce the ambiguity and achieve better performance.

  18. Experiments List Pages Post Pages

  19. Experiments

  20. Experiments

  21. Experiments

  22. Future works http://discussions.apple.com/

  23. Conclusion • A template-independent approach to extract structured data from web forum sites. • we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap. • http://research.microsoft.com/people/jmyang/

More Related