1 / 51

Wrapper Construction

Wrapper Construction. Charis Ermopoulos Qian Yang Yong Yang Hengzhi Zhong. Background. HUGE amount of information on the web They cannot be easily accessed and manipulated Information intended to be browsed by humans, not computers Information Extraction is difficult

eytan
Download Presentation

Wrapper Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wrapper Construction Charis ErmopoulosQian YangYong YangHengzhi Zhong

  2. Background • HUGE amount of information on the web • They cannot be easily accessed and manipulated • Information intended to be browsed by humans, not computers • Information Extraction is difficult • A wrapper is a procedure for extracting a particular resource’s content • Hand-coding wrapper is tedious and difficult to maintain

  3. Virtual Integration Architecture User queries Mediated schema Mediator: Reformulation engine optimizer Data source catalog Execution engine wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structure files, web sites.

  4. Wrapper Construction • Two major approaches • machine learning: typically requires some hand-labeled data • data-intensive, completely automatic

  5. Overview • RoadRunner: Towards Automatic Data Extraction from Large Web Sites (Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo) • Wrapper Induction for Information Extraction (Nicholas Kushmerick) • Web Data Extraction Based on Partial Tree Alignment

  6. RoadRunner • NO user interaction during generation process • NO apriori knowledge about page organization • Compares with 2 HTML pages at a time and uses mismatches to identify structures

  7. RoadRunner • Basic Idea: • Input: a set of data-intensive, regular structured HTML pages • Generates a wrapper efficiently and automatically by inferring Union-Free Regular Expression grammar for the HTML code

  8. Input HTML Page • Data are stored in a DBMS • HTML pages are produced using some scripts • PHP, PERL, etc • Pages that are generated by the same script have similar structure

  9. Input HTML Page www.csbooks.com/author?Paul+Jones

  10. Input HTML Page www.csbooks.com/author?John+Smith

  11. Input HTML Page • 2 input html pages: • One is used as the wrapper, w • The other is used as the sample, s. • Generalize w by matching it with s

  12. Union Free Regular Expressions • Union-Free Regular Expressions • Alphabet of symbols • All elements of • Some operators • ? matches 0 or 1 occurrence • * zero or more occurrences • + one or more occurrences • a|b a or b • Don’t handle disjunction cases

  13. Union Free Regular Expressions • Close correspondence between nested types and UFREs • Straightforward mapping from UFREs to nested types • #PCDATA  string fields • +  lists • ?  nullable fields (optional fields)

  14. Union Free Regular Expression • (A, B, C, …) +  non-empty list of tuples (A: string, B: string, C: string, …) • (A, B, C, …)*  possible empty list

  15. Union Free Regular Expression

  16. Drawbacks • Limited expressive power • Non-regular languages • (folders)n List of folders • Non-regular: (folders)n

  17. Drawbacks • Limited expressive power • Regular languages that require unions • Reviews on amazon.com • RoadRunner not able to factorize the list = can’t discover repeated patterns

  18. Extraction Process • Find the minimal UFRE • Iteratively find the least upper bounds on the RE lattice to generate a wrapper for the input HTML pages • Compute the least upper bound of 2 UFREs  matching algorithm

  19. The Matching Technique • The matching algorithm consists in parsing the sample using the wrapper. • A mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper.

  20. Two types of Mismatches • String mismatches: mismatches that happen when different strings occur in corresponding positions of the wrapper and sample • Tag mismatches: mismatches between different tags on the wrapper and the sample, or between one tag and one string.

  21. String Mismatches • If the two pages belong to the same class, string mismatches may be due only to different values of a database field. • These mismatches are used to discover fields (i.e., #PCDATA). (#PCDATA)?

  22. Tag mismatches • Tag mismatches are used to discover iterators (list of items) and optionals (items appearing conditionally). • First look for repeated patterns (i.e., patterns under an iterator), and then, if this attempt fails, try to identify an optional pattern.

  23. Tag mismatches: Optionals (<IMG src=../>)? Wrapper Generalization: Once the optional pattern has been identified, we may generalize the wrapper accordingly and then resume the parsing. In this case, the wrapper is generalized by introducing one pattern of the form ( <IMG src=.../> )?

  24. Tag mismatches: Iterators • Assume mismatch is caused by repeated elements in a list. • Match possible squares against earlier squares. • Generalize the wrapper by finding all contiguous repeated occurrences • (<li><i>Title:</i>#PCDATA</li>)+

  25. Extracted Result

  26. Recursive Example

  27. Matching as an AND-OR Tree

  28. Complexity • Exponential time complexity with respect to the input lengths (token number in pages) Lowering the Complexity: • Bounds on the fan–out of OR nodes • Limited backtracking • Delimiters

  29. Limitations • Assumptions • Pages are well-structured • Structure can be modeled by a union free regular expression • Search Space for explaining mismatches is huge • Pruning the search space, will result in pruning possible wrappers

  30. Experimental Result

  31. Comparison with Other Works

  32. Overview • RoadRunner: Towards Automatic Data Extraction from Large Web Sites (Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo) • Wrapper Induction for Information Extraction (Nicholas Kushmerick) • Web Data Extraction Based on Partial Tree Alignment

  33. Wrapper Induction for Information Extraction Nicholas Kushmerick’s Dissertation

  34. Wrapper Indcution • Wrapper Induction: automatic construction • Automatic programming • Very very difficult • Particular classes of programs are feasible • Technique: inductive learning • Attributes are identified by delimiters • Input: a set of examples {…,<Pn, Ln>,…} • Output: a wrapper W∈W , such that W(Pn) = Ln, for every <Pn, Ln>

  35. A Simple Wrapper: LR Wrapper • Using left- and right-hand delimiters, allows • l1 = <B>, r1 = </B>, l2 = <I>, r1 = </I> • Delimiter candidates • Prefixes and suffixes of attributes • Candidate validity • 4 constrains (e.g. not a substring of any attr.) • Search policy • For each delimiter, start with shortest prefix & suffix • Stop when valid Cands_ l1 = {‘</I><BR><B>’, ‘/I><BR><B>’,‘I><BR><B>’, …, ’B>’, ‘>’ } <HTML><BODY> <B>Congo</B> <I>242</I><BR> <B>Spain</B> <I>34</I> </BODY></HTML> Congo242 Spain34

  36. Beyond LR • HLRT: use the head and tail to locate the interesting area (2K+2 delimiters) • OCLR: use open and close to indicate each tuple (2K+2 delimiters) • HOCLRT: combines HLRT and OCLR • Nested documents • N-LR • N-HLRT Name: John address: 12 Main St Phone: 123-4567 phone: 444-5555 Name: Fred address: 9 Maple Lane Phone: 666-7777 Name: Jane

  37. Expressiveness(1) • Total 30 resources, 10 samples each

  38. Expressiveness(2)

  39. Generating Examples • One simple way: ask a person • Automatically labeling • Using domain-specific heuristics • Primitive: regular expression (e.g. 1?[0-9]:[0-9][0-9]) • NLP • Asking an already wrapped resources • Why still need wrapper induction • Performance • Tolerate high rate of noise

  40. How many examples are enough • For each wrapper class, how many examples (N) needed to ensure with high probability (p1), the learned wrapper makes a mistake only rarely (p2) • K = 4 attributes per tuple • Shortest example page has length R = 10,000 • p1 > 0.95, p2 < 0.05, if N >1534

  41. Summary • Advantages • Fast to learn and extract • Drawbacks • Cannot handle permutations • Cannot handle missing items

  42. Overview • RoadRunner: Towards Automatic Data Extraction from Large Web Sites (Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo) • Wrapper Induction for Information Extraction (Nicholas Kushmerick) • Web Data Extraction Based on Partial Tree Alignment

  43. Extracting the data records form the web • wrapper induction a set of manually labeled positive and negative examples supervised learning for data extraction rules • automatic extraction pattern discovery based on heuristic rules (repeating-tags, ontology-matching) • partial tree alignment automatic extraction no assumption about contiguous data records two steps: (1) identifying data records in a page (2) aligning and extracting data items from the data records

  44. Identifying data records • Building a HTML tag tree (1) nested structure of HTML tags (2) visual information

  45. Identifying data records • mining data regions: comparing tag strings of nodes a generalized node to cluster similar nodes • identify data records form generalized nodes

  46. Data extraction two steps: (1) build rooted tag tree for each data record (2) partial tree alignment

  47. Partial tree alignment • tree operations • node removal, insertion and replacement • tree edit distance • cost associated with the minimum set of operations • needed to transform tree A into tree B • minimum –cost mapping between two trees • dynamic programming

  48. Partial tree alignment • simple tree matching (STM) • no replacement and level crossing are allowed

  49. Partial tree alignment • progressively growing a seed tree Ts • Ts is initialized as the tree with the maximum number of • data fields • node is inserted if insertion location can be determined • mismatched nodes which are not inserted into Ts will • be reprocessed at later stages

  50. Experimental Results • number of sites used: 49 • total number of pages used: 72 ( randomly collected) • data records are extracted with high accuracy

More Related