1 / 34

Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web

Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web. Andrew Hogue Google MIT CSAIL. Acknowledgments. David Karger (karger@csail.mit.edu) Haystack Group (http://haystack.csail.mit.edu). Agenda. Overview Demo Details Induction Matching Semantics

masako
Download Presentation

Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue Google MIT CSAIL WWW 2005 -- Chiba, Japan

  2. Acknowledgments • David Karger (karger@csail.mit.edu) • Haystack Group (http://haystack.csail.mit.edu) WWW 2005 -- Chiba, Japan

  3. Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan

  4. Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan

  5. Unwrapping the Web • Majority of semantic content in “deep web” • Transformed into human-readable HTML by scripts • HTML is difficult for automated agents to understand • Little incentive for content providers to provide RDF markup • How to “unwrap” this content? WWW 2005 -- Chiba, Japan

  6. Thresher • Simple UI for wrapper induction on structured web content • “Demonstrate” examples of objects • Induce wrapper, or pattern, based on DOM • User may also label properties with RDF WWW 2005 -- Chiba, Japan

  7. Thresher • Built on Haystack Semantic Web client • Everything is RDF • Everything has context menus • Thresher brings RDF into the web browser • Wrappers reify web objects for full interaction WWW 2005 -- Chiba, Japan

  8. Thresher • Underlying wrapper algorithm based on tree edit distance • Align user’s examples • Keep aligned nodes (layout elements) • Wildcard non-aligned nodes (content) • Pattern matching is also alignment WWW 2005 -- Chiba, Japan

  9. Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan

  10. Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan

  11. Wrapper Induction • Wrapper: pattern created from examples • User provides positive examples • Generalize examples into reusable pattern • Existing techniques: • head-left-right-tail (HLRT) descriptors • Hidden Markov models • Support Vector Machines • Other Machine Learning WWW 2005 -- Chiba, Japan

  12. Wrapper Induction • Our approach: take advantage of hierarchical structure of HTML • Each example picks out a subtree of DOM • Calculate tree edit distance between examples • Least-cost edit distance gives best mapping • Remove unmapped nodes to make pattern WWW 2005 -- Chiba, Japan

  13. Tree Edit Distance • Calculate cost ( ) of sequence of operations to transform one tree into the other • Operations: insert, delete, change a node • Cost of an operation = size of subtree it affects • Least-cost set of operations gives best mapping between elements WWW 2005 -- Chiba, Japan

  14. Mapping Examples WWW 2005 -- Chiba, Japan

  15. Mapping Examples WWW 2005 -- Chiba, Japan

  16. Mapping Examples WWW 2005 -- Chiba, Japan

  17. Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan

  18. Pattern Matching • Look for document subtrees with similar structure • Find alignments of wrapper in tree • Require every node in wrapper be mapped to some node in document subtree • Wildcards match zero or more times • Each valid alignment is a match WWW 2005 -- Chiba, Japan

  19. Matching Example WWW 2005 -- Chiba, Japan

  20. Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan

  21. Adding Semantics • How to tie wrappers to semantic content? • Assert RDF statements about unwrapped objects • Tied to wrapper structure • Classes bound to wrappers • Properties bound to wildcards WWW 2005 -- Chiba, Japan

  22. Semantic Labels WWW 2005 -- Chiba, Japan

  23. Semantic Matching WWW 2005 -- Chiba, Japan

  24. Semantic Matching WWW 2005 -- Chiba, Japan

  25. Semantic Matching [ <rdf:type> <TalkAnnouncement> ; <series> “Dertouzos Lect…” ; <dc:title> “Distributed Hash…” ; <time> “3:30 PM” ] WWW 2005 -- Chiba, Japan

  26. Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan

  27. Automatically Adding Examples • Find additional examples automatically • Consider nodes neighboring the example • Require low normalized cost: • Often allows us to create wrappers with a single example WWW 2005 -- Chiba, Japan

  28. Automatically Adding Examples  T TR WWW 2005 -- Chiba, Japan

  29. List Collapse • Current wrappers generalize well for single elements • Will not recognize variable length lists • Collapse neighboring nodes with low normalized cost • For matching, allow nodes to match more than once WWW 2005 -- Chiba, Japan

  30. Wrapper Wrap-up • Gather user example(s) • Automatically find additional examples • Generalize examples using best mapping • Add semantic labels • Match by finding alignments • Overlay objects on the page for interaction WWW 2005 -- Chiba, Japan

  31. Additional Tools • Wrapper Sharing • RSS • Web Operations WWW 2005 -- Chiba, Japan

  32. Our Contributions • End-user wrapper induction • Few examples required • Bring object interaction into the browser • Wrappers bridge syntactic-semantic gap WWW 2005 -- Chiba, Japan

  33. Future Work and Applications • Document-level classes • Page reformatting • Autonomous agent interaction • Negative examples • Automatic wrapper induction WWW 2005 -- Chiba, Japan

  34. ahogue@google.com http://haystack.csail.mit.edu WWW 2005 -- Chiba, Japan

More Related