1 / 10

Chun-Nan Hsu Arizona State University

Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules. Chun-Nan Hsu Arizona State University. Introduction. Based on the problem of wrapper generation (which extracts from structured text only)

Download Presentation

Chun-Nan Hsu Arizona State University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University 1

  2. Introduction • Based on the problem of wrapper generation (which extracts from structured text only) • Attempts to generate wrappers for unstructured data as well: • Missing attributes • Multiple attribute values • Variant attribute permutations • Exceptions and typos (in extracted items themselves, or in contextual items?) • Fairly high-level overview 2

  3. Extraction Method • HTML Tokenizer • All-CAPS strings • Strings beginning with capital letter • Lowercase strings • HTML tags • etc. • Finite-State Transducer (FSA with output instructions) 3

  4. Contextual Rules Describe “separators” between fields • May or may not be physical characters (e.g. “CA90210” “CA” and “90210”) • Appear before and after each extraction field (h and t, respectively) • Composed of a left context (hL or tL) and a right context (hR or tR) • Used to describe transitions between states in the FST 4

  5. Key • ? : wildcard • U : state to extract URL • U : state to skip over tokens • until we reach N • N : state to extract Name • N : state to skip over tokens • until we reach A • s<X,Y> : separator rule for • the separator of • states X and Y • etc. s<N,N> / ε s<U,U> / ε s<U,N> / “N=” + next_token Finite-State Transducer Example application: faculty web pages <LI> <A HREF="…"> Mani Chandy </A>, <I>Professor of Computer Science</I> and <I>Executive Officer for Computer Science</I> … <LI> Fred Thompson, <I>Professor Emeritus of Applied Philosophy and Computer Science</I> ? / next_token ? / ε _U ? / ε U etc. s<b,U> / “U=” + next_token b _N s<b,N> / “N=” + next_token N ? / ε ? / next_token 5

  6. Induction Algorithm • Calculate permutation of extraction fields (e.g. U, U, N, N, M) and add transitions to the graph if necessary • Not every permutation has to appear in the training data!! (Somewhat incorrect info in [eikvil99]) 6

  7. Induction Algorithm For each extraction field in the training set: • Generate the left- and right-separators, add them to the corresponding contextual rule lists • Align tokens into columns Heuristic: align word tokens together, non-word tokens starting closest to the separator boundary • Attempt generalization with other rules • Replace related tokens with least common ancestor in taxonomy tree • Generalize whitespace tokens • Remove any duplicate rules 7

  8. Comparison with DEG Method • Similarities: • FSA Regular Expressions • Differences: • Focuses on separators between extraction fields, whereas DEG focuses on patterns of the field itself • Designed to generate wrappers (specific website) rather than general-purpose extraction rules 8

  9. Results Notes: • Training tuples used = # of tuples labeled by the user needed to cover total tuples in training pages • Recall after 10 pages: 60/69 = 87% • Precision = …100%? • What does “Total unseen tuples covered” mean…? • No comparison with other algorithms 9

  10. Conclusions • SoftMealy uses a FST to construct wrappers for structured and semi-structured text. • The FST structure is based on contextual rules that describe what separates each extraction field. • The rules are learned from training documents, marked up by the user interactively. • This could be an interesting approach, but a more complete analysis is needed. 10

More Related