Chun-Nan Hsu Arizona State University

Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University 1

Introduction • Based on the problem of wrapper generation (which extracts from structured text only) • Attempts to generate wrappers for unstructured data as well: • Missing attributes • Multiple attribute values • Variant attribute permutations • Exceptions and typos (in extracted items themselves, or in contextual items?) • Fairly high-level overview 2

Extraction Method • HTML Tokenizer • All-CAPS strings • Strings beginning with capital letter • Lowercase strings • HTML tags • etc. • Finite-State Transducer (FSA with output instructions) 3

Contextual Rules Describe “separators” between fields • May or may not be physical characters (e.g. “CA90210” “CA” and “90210”) • Appear before and after each extraction field (h and t, respectively) • Composed of a left context (hL or tL) and a right context (hR or tR) • Used to describe transitions between states in the FST 4

Key • ? : wildcard • U : state to extract URL • U : state to skip over tokens • until we reach N • N : state to extract Name • N : state to skip over tokens • until we reach A • s<X,Y> : separator rule for • the separator of • states X and Y • etc. s<N,N> / ε s<U,U> / ε s<U,N> / “N=” + next_token Finite-State Transducer Example application: faculty web pages <LI> <A HREF="…"> Mani Chandy </A>, Professor of Computer Science and Executive Officer for Computer Science … <LI> Fred Thompson, Professor Emeritus of Applied Philosophy and Computer Science ? / next_token ? / ε _U ? / ε U etc. s<b,U> / “U=” + next_token b _N s<b,N> / “N=” + next_token N ? / ε ? / next_token 5

Induction Algorithm • Calculate permutation of extraction fields (e.g. U, U, N, N, M) and add transitions to the graph if necessary • Not every permutation has to appear in the training data!! (Somewhat incorrect info in [eikvil99]) 6

Induction Algorithm For each extraction field in the training set: • Generate the left- and right-separators, add them to the corresponding contextual rule lists • Align tokens into columns Heuristic: align word tokens together, non-word tokens starting closest to the separator boundary • Attempt generalization with other rules • Replace related tokens with least common ancestor in taxonomy tree • Generalize whitespace tokens • Remove any duplicate rules 7

Comparison with DEG Method • Similarities: • FSA Regular Expressions • Differences: • Focuses on separators between extraction fields, whereas DEG focuses on patterns of the field itself • Designed to generate wrappers (specific website) rather than general-purpose extraction rules 8

Results Notes: • Training tuples used = # of tuples labeled by the user needed to cover total tuples in training pages • Recall after 10 pages: 60/69 = 87% • Precision = …100%? • What does “Total unseen tuples covered” mean…? • No comparison with other algorithms 9

Conclusions • SoftMealy uses a FST to construct wrappers for structured and semi-structured text. • The FST structure is based on contextual rules that describe what separates each extraction field. • The rules are learned from training documents, marked up by the user interactively. • This could be an interesting approach, but a more complete analysis is needed. 10

Chun-Nan Hsu Arizona State University

Chun-Nan Hsu Arizona State University

Presentation Transcript

Michael Dugger* Arizona State University

Arizona State University

Subbarao Kambhampati Arizona State University

Subbarao Kambhampati Arizona State University

Mijung Kim (Arizona State University) K. Selçuk Candan (Arizona State University)

Naomi Lundskow Arizona State University

2010 QICF Arizona State University

Welcome to Arizona State University

Chitta Baral Arizona State university

Arizona State University

UCL / Newcastle University / Arizona State University

Arizona State University Football

Arizona State University

Arizona State University

Arizona State University

Joseph Murray Arizona State University

Eric Hsu San Francisco State University

Chris Kyselka Arizona State University

Arizona State University

Arizona State University

Arizona State University