1 / 16

Character-Level Analysis of Semi-Structured Documents for Set Expansion

Character-Level Analysis of Semi-Structured Documents for Set Expansion. Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA. Summary. We illustrated… the construction of character-based wrappers used in SEAL

Download Presentation

Character-Level Analysis of Semi-Structured Documents for Set Expansion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA

  2. Summary We illustrated… the construction of character-based wrappers used in SEAL a method to extend SEAL to learn binary relational concepts We showed that… character-based wrappers perform better than HTML-based binarySEAL has good performance

  3. Background – SEAL Set Expander for Any Language Wang & Cohen, ICDM 2007 An example of set expansion Given an input query (seeds): { survivor, amazing race } The output answer is: { american idol, big brother, ... }

  4. Features Independent of human&markup language Support seeds in English, Chinese, Japanese, ... Accept documents in HTML, XML, SGML, TeX, … Does not require pre-annotatedtraining data Utilize readily-available corpus: World Wide Web Research contributions Automatically construct wrappers for extracting candidate items Rank candidates using random walk

  5. Fetcher: Download web pages containing all seeds Extractor: Learn and construct wrappers Ranker: Rank candidate items using Random Walk SEAL’s Architecture Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … Canon Nikon Olympus

  6. Wrapper Learner • Current WL only learns unary relation • e.g., x is a mayor • A unary wrapper consists of a pair of left (L) and right (R) context string • Extracts all strings between L, R • Extended WL learns binary relation • e.g., x is the mayor of city y • A binary wrapper has an additional middle (M) context string • Extracts string pairs between L, M and M, R

  7. Unary Relation Wrapper Construction

  8. Real Unary Wrappers Given seeds: Ford, Nissan, Toyota Examples of wrappersandextractions:

  9. Mock Unary Example Given seeds: Ford, Nissan, Toyota Example document written in an unknown mark-up language:

  10. Contexttries for mock example: Constructed unarywrappers:

  11. Metric – Mean Average Precision Dataset – 36 datasets(Wang & Cohen, ICDM 2007) Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper Result – stricter wrappers perform worse Unary SEAL Evaluation

  12. Binary Wrapper Construction • Keep track of all middle contexts: • In the unary code, replace Intersect with:

  13. Real Binary Wrappers

  14. Binary SEAL Evaluation • Relational Datasets • Surveyed more than a dozen • Randomly selected five: • Bootstrap results ten times using iSEAL (an iterative version of SEAL) • Wang & Cohen, ICDM 2008

More Related