Schema-Guided Wrapper Maintenance for Web-Data Extraction

Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA

Wrappers for Web Sources • Extract information from Web pages • Used in many Web-based applications XML Wrapper HTML Documents RDBMS Wrapper Application (e.g., data Integration) ……… ……… Programs Wrapper

Problem • The Web are very dynamic: contents, page structures • Original wrappers can stop working: rely on Web page structures • Re-generating wrappers is not easy: heavy workload to system developers Original Wrapper Extract nothing … Changed Documents Original Wrapper Incomplete results ……… ……… Original Wrapper Incorrect results

Example The original wrapper fails due to the structure change.

Problems • Wrapper verification: Is a wrapper is operating correctly? • Several studies have been conducted on the verification problem: • E.g., computing the similarity between a wrapper’s expected and observed output, “regression test” • Wrapper maintenance: how to automatically modify a wrapper when the pages have changed?  Focus of this work

Outline • Motivation •  System overview • Schema-Guided Wrapper Maintenance • Experiments • Related Work and Conclusion

Documents Changed Documents Wrapper Generator Wrapper Executor Wrapper XML Repository Schema Rule Wrapper Maintainer Rule Re-induction Data Feature Discovery Block Configuration Data Item Recovery The SG-WRAM System

User-Defined Schema User provides schema for the target data <!ELEMENT VideoList (Video+)> <!ELEMENT Video (Name, Director, Actors, Price)> <!ELEMENT Name (#PCDATA)> <!ELEMENT Director (#PCDATA)> <!ELEMENT Actors (#PCDATA)> <!ELEMENT Price (VHSPrice, DVDPrice)> <!ELEMENT VHSPrice (#PCDATA)> <!ELEMENT DVDPrice (#PCDATA)>

Schema-Guided Wrapper Generation • Using a GUI toolkit, users can map data items in HTML pages to elements in DTD DTD tree HTML page

Schema-Guided Wrapper Generation • Internally, the system computes the mappings from the corresponding HTML tree to the DTD tree • Then generates the extraction rule HTML tree DTD tree

Paths to the data items Value of the data item Expressing Extraction Rule in XQuery • Each rule is an FLWR XQuery expression Example FOR $vedio IN $vedioList/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1] RETURN <vedio> { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } </vedio>

Annotations for data items • Describe the semantic meaning of a data item • Indicate the location of the data item • Specified by the user using the GUI • Recorded in the function of “contains(pathToAnnotation, annotationValue)” in XPath /body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td[1]/text()[0][contains(null,"directed by")]

Outline • Motivation • System Overview •  Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments • Related Work and Conclusion

Intuition of the approach • The page structure could change • Observation: many “features” of data items are more static, e.g.: • Hyperlink • Annotation • Pattern • These features can help us find the new places of the old data items

Step 1: Data-feature discovery • Compute features of the data items in the original page

Data-Pattern Feature • A syntactic feature • Represented as a regular expression • E.g. $ 15.38  [$][0-9]{0,}[0-9](.)[0-9]{2} • Can be extracted using existing technologies, e.g., [Brin98], [GHQR98], [LM00]

Get annotation and hyperlink information from the original page Checking the XQuery based extraction rule Hyperlink: step of “…/a/…” in the path Annotation: function of “contains()” Hyperlink Indication Annotation Value Path from data item to annotation Annotations and Hyperlinks { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } { LET $actors = $vedio/text()[contains( /preceding-sibling::b[0] ,"Featuring")] RETURN <actors> $actors </actors> }

Step 2: Data-Item Recovery • Traverse the new HTML tree following the depth-first traversal order • Use the old features to identify potential data items using 3 matching conditions: • Hyperlink • Annotation • Data pattern

Example [A-Z][a-z]{0,} ok ok Check hyperlink Check data pattern Recognize a data item Find value starting from annotation yes Recognize a data item Find annotation Check data pattern [$][0-9]{0,}[0-9](.)[0-9]{2}

Results of Data Item Recovery • A mapping list including all the recognized data items • Each mapping contains • Value of the data item • Path to it in the HTML tree • Path of the corresponding DTD element A sample mapping: M1’ (D: “May”, HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0], SP: VideoList/Video/Name )

Observation: Data items are located in semantic blocks Conforms to the user-defined schema Data items are grouped in semantic blocks Step 3: Block Configuration Partial-Match Full-Match Over-Match

Computing “Full Match” Blocks • Identify the level in a top-down manner • Check the level by recursively considering the matches between candidate blocks and the schema “Full match” blocks

Results of Block Configuration • A set of blocks that can fully match with the DTD • Each of them is represented as a list of mappings Examples

Step 4: Rule Re-Induction • Semantic blocks contain mappings from data items in HTML to DTD elements • Induce new extraction rule by calling the induction algorithm in wrapper generator • Refine the rule by trying to ensure the extraction rule cover all other semantic blocks • Generalization is necessary

Outline • Motivation • System Overview • Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction •  Experiments • Related Work and Conclusion

Web Sources • From October 2002 toMay 2003 • Collected Web page changes • From 16 data-intensive sites • Using site search engine or from the same URL • All the pages have complex table structures • Observed changes • Data items (add, delete, modify) • Table structure  non-table structure • Complex table structure re-arrangement

Experiment Procedures New Web Docs Original Web Docs step1 Wrapper Repository Wrapper Generator Original Wrappers step2 Repaired Wrappers Check Extraction Results ……… Changed pages step3 Wrapper Maintainer

Experiment Metrics • Recall (R) • Proportion of the correctly extracted data items of all the data items that should be extracted • Precision (P) • Proportion of the correctly extracted data items of all the data items that have been extracted

Original wrappers after changes

New wrappers (after item recovery)

New Wrappers (final)

Related Work on Wrapper Maintenance • [Kushmerick 99] • Using simple numeric features of the extracted strings • [Lerman K., Minton S. 00] • Using the starting and ending strings as the description of the data fields • [Chidlovskii B. 01] • Syntactic features of data items to be extracted, and semantic features: URL, time strings, entities…

Comparions • These approaches heavily rely on the syntactic features of the data items, and may not precisely recognize data items.

Conclusion • SG-WRAM: a wrapper-maintenance system • Intuition: use features that are more stable • Pattern • Hyperlink • Annotation • Four steps of the approach: • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments showed that it is effective

Thank you! Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA

Schema-Guided Wrapper Maintenance for Web-Data Extraction