1 / 8

Implementing Automatic Value Extraction from Structured Web Pages

Implementing Automatic Value Extraction from Structured Web Pages. Varun Ganapathi, Jonathan Pines, Josh Wiseman. Problem. Context: Many web pages are generated by applying a template to structured data Goal: Given a set of pages generated from a template, infer the template.

zocha
Download Presentation

Implementing Automatic Value Extraction from Structured Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing Automatic Value Extraction from Structured Web Pages Varun Ganapathi, Jonathan Pines, Josh Wiseman

  2. Problem • Context: • Many web pages are generated by applying a template to structured data • Goal: • Given a set of pages generated from a template, infer the template. • Extract values from previously unseen pages generated from the template • Why? • The template encodes structure that usually has semantic meaning. • The structured values that back a page are all the important information in the page.

  3. What is a Template? • It is a special case of a context free grammar • Tuple ( fixed-length ordered lists ) • Sets ( arbitrary-length lists denoted by separators ) • Example of Instantiated Template: <elem>Ethan Hunt comes face to face with a dangerous and … </elem> <elem>6.8</elem> <set> <tuple><elem>Tom Cruise</elem><elem>Ethan Hunt</elem></tuple> <tuple><elem>Ving Rhames</elem><elem>Luther Strickell</elem></tuple> </set>

  4. Learning Templates • Use the following observations: • When tokens occur frequently together, it might be because they are derived from the same template • The strings derived from templates have certain properties • Ordered • Nested • Loop • Find equivalence classes of differentiated tokens • Increase partial template • Differentiate tokens based on partial template • Construct Template using Patterns

  5. Evaluation • We manually extracted “interesting” data from several IMDB movie pages. <elem>Ethan Hunt comes face to face with a dangerous and … </elem> <elem>6.8</elem> <set> <tuple><elem>Tom Cruise</elem><elem>Ethan Hunt</elem></tuple> <tuple><elem>Ving Rhames</elem><elem>Luther Strickell</elem></tuple> </set> • Some attributes: title, writers, directors, plot summary, rating, actors, languages, trivia, … • Attributes were either: • Correct: Our system was perfect. • Partially Correct: Our system got a bit too much. • Incorrect: Our system missed some data.

  6. Results

  7. Results

  8. Results • Attributes: • 5 correct • 5 partially correct • 6 incorrect

More Related