1 / 35

Schema-Guided Wrapper Maintenance for Web-Data Extraction

Schema-Guided Wrapper Maintenance for Web-Data Extraction. Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA. Wrappers for Web Sources. Extract information from Web pages Used in many Web-based applications. XML.

isusan
Download Presentation

Schema-Guided Wrapper Maintenance for Web-Data Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA

  2. Wrappers for Web Sources • Extract information from Web pages • Used in many Web-based applications XML Wrapper HTML Documents RDBMS Wrapper Application (e.g., data Integration) ……… ……… Programs Wrapper

  3. Problem • The Web are very dynamic: contents, page structures • Original wrappers can stop working: rely on Web page structures • Re-generating wrappers is not easy: heavy workload to system developers Original Wrapper Extract nothing … Changed Documents Original Wrapper Incomplete results ……… ……… Original Wrapper Incorrect results

  4. Example The original wrapper fails due to the structure change.

  5. Problems • Wrapper verification: Is a wrapper is operating correctly? • Several studies have been conducted on the verification problem: • E.g., computing the similarity between a wrapper’s expected and observed output, “regression test” • Wrapper maintenance: how to automatically modify a wrapper when the pages have changed?  Focus of this work

  6. Outline • Motivation •  System overview • Schema-Guided Wrapper Maintenance • Experiments • Related Work and Conclusion

  7. Documents Changed Documents Wrapper Generator Wrapper Executor Wrapper XML Repository Schema Rule Wrapper Maintainer Rule Re-induction Data Feature Discovery Block Configuration Data Item Recovery The SG-WRAM System

  8. User-Defined Schema User provides schema for the target data <!ELEMENT VideoList (Video+)> <!ELEMENT Video (Name, Director, Actors, Price)> <!ELEMENT Name (#PCDATA)> <!ELEMENT Director (#PCDATA)> <!ELEMENT Actors (#PCDATA)> <!ELEMENT Price (VHSPrice, DVDPrice)> <!ELEMENT VHSPrice (#PCDATA)> <!ELEMENT DVDPrice (#PCDATA)>

  9. Schema-Guided Wrapper Generation • Using a GUI toolkit, users can map data items in HTML pages to elements in DTD DTD tree HTML page

  10. Schema-Guided Wrapper Generation • Internally, the system computes the mappings from the corresponding HTML tree to the DTD tree • Then generates the extraction rule HTML tree DTD tree

  11. Paths to the data items Value of the data item Expressing Extraction Rule in XQuery • Each rule is an FLWR XQuery expression Example FOR $vedio IN $vedioList/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1] RETURN <vedio> { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } </vedio>

  12. Annotations for data items • Describe the semantic meaning of a data item • Indicate the location of the data item • Specified by the user using the GUI • Recorded in the function of “contains(pathToAnnotation, annotationValue)” in XPath /body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td[1]/text()[0][contains(null,"directed by")]

  13. Outline • Motivation • System Overview •  Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments • Related Work and Conclusion

  14. Intuition of the approach • The page structure could change • Observation: many “features” of data items are more static, e.g.: • Hyperlink • Annotation • Pattern • These features can help us find the new places of the old data items

  15. Step 1: Data-feature discovery • Compute features of the data items in the original page

  16. Data-Pattern Feature • A syntactic feature • Represented as a regular expression • E.g. $ 15.38  [$][0-9]{0,}[0-9](.)[0-9]{2} • Can be extracted using existing technologies, e.g., [Brin98], [GHQR98], [LM00]

  17. Get annotation and hyperlink information from the original page Checking the XQuery based extraction rule Hyperlink: step of “…/a/…” in the path Annotation: function of “contains()” Hyperlink Indication Annotation Value Path from data item to annotation Annotations and Hyperlinks { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } { LET $actors = $vedio/text()[contains( /preceding-sibling::b[0] ,"Featuring")] RETURN <actors> $actors </actors> }

  18. Step 2: Data-Item Recovery • Traverse the new HTML tree following the depth-first traversal order • Use the old features to identify potential data items using 3 matching conditions: • Hyperlink • Annotation • Data pattern

  19. Example [A-Z][a-z]{0,} ok ok Check hyperlink Check data pattern Recognize a data item Find value starting from annotation yes Recognize a data item Find annotation Check data pattern [$][0-9]{0,}[0-9](.)[0-9]{2}

  20. Results of Data Item Recovery • A mapping list including all the recognized data items • Each mapping contains • Value of the data item • Path to it in the HTML tree • Path of the corresponding DTD element A sample mapping: M1’ (D: “May”, HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0], SP: VideoList/Video/Name )

  21. Observation: Data items are located in semantic blocks Conforms to the user-defined schema Data items are grouped in semantic blocks Step 3: Block Configuration Partial-Match Full-Match Over-Match

  22. Computing “Full Match” Blocks • Identify the level in a top-down manner • Check the level by recursively considering the matches between candidate blocks and the schema “Full match” blocks

  23. Results of Block Configuration • A set of blocks that can fully match with the DTD • Each of them is represented as a list of mappings Examples

  24. Step 4: Rule Re-Induction • Semantic blocks contain mappings from data items in HTML to DTD elements • Induce new extraction rule by calling the induction algorithm in wrapper generator • Refine the rule by trying to ensure the extraction rule cover all other semantic blocks • Generalization is necessary

  25. Outline • Motivation • System Overview • Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction •  Experiments • Related Work and Conclusion

  26. Web Sources • From October 2002 toMay 2003 • Collected Web page changes • From 16 data-intensive sites • Using site search engine or from the same URL • All the pages have complex table structures • Observed changes • Data items (add, delete, modify) • Table structure  non-table structure • Complex table structure re-arrangement

  27. Experiment Procedures New Web Docs Original Web Docs step1 Wrapper Repository Wrapper Generator Original Wrappers step2 Repaired Wrappers Check Extraction Results ……… Changed pages step3 Wrapper Maintainer

  28. Experiment Metrics • Recall (R) • Proportion of the correctly extracted data items of all the data items that should be extracted • Precision (P) • Proportion of the correctly extracted data items of all the data items that have been extracted

  29. Original wrappers after changes

  30. New wrappers (after item recovery)

  31. New Wrappers (final)

  32. Related Work on Wrapper Maintenance • [Kushmerick 99] • Using simple numeric features of the extracted strings • [Lerman K., Minton S. 00] • Using the starting and ending strings as the description of the data fields • [Chidlovskii B. 01] • Syntactic features of data items to be extracted, and semantic features: URL, time strings, entities…

  33. Comparions • These approaches heavily rely on the syntactic features of the data items, and may not precisely recognize data items.

  34. Conclusion • SG-WRAM: a wrapper-maintenance system • Intuition: use features that are more stable • Pattern • Hyperlink • Annotation • Four steps of the approach: • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments showed that it is effective

  35. Thank you! Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA

More Related