350 likes | 355 Views
Schema-Guided Wrapper Maintenance for Web-Data Extraction. Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA. Wrappers for Web Sources. Extract information from Web pages Used in many Web-based applications. XML.
E N D
Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA
Wrappers for Web Sources • Extract information from Web pages • Used in many Web-based applications XML Wrapper HTML Documents RDBMS Wrapper Application (e.g., data Integration) ……… ……… Programs Wrapper
Problem • The Web are very dynamic: contents, page structures • Original wrappers can stop working: rely on Web page structures • Re-generating wrappers is not easy: heavy workload to system developers Original Wrapper Extract nothing … Changed Documents Original Wrapper Incomplete results ……… ……… Original Wrapper Incorrect results
Example The original wrapper fails due to the structure change.
Problems • Wrapper verification: Is a wrapper is operating correctly? • Several studies have been conducted on the verification problem: • E.g., computing the similarity between a wrapper’s expected and observed output, “regression test” • Wrapper maintenance: how to automatically modify a wrapper when the pages have changed? Focus of this work
Outline • Motivation • System overview • Schema-Guided Wrapper Maintenance • Experiments • Related Work and Conclusion
Documents Changed Documents Wrapper Generator Wrapper Executor Wrapper XML Repository Schema Rule Wrapper Maintainer Rule Re-induction Data Feature Discovery Block Configuration Data Item Recovery The SG-WRAM System
User-Defined Schema User provides schema for the target data <!ELEMENT VideoList (Video+)> <!ELEMENT Video (Name, Director, Actors, Price)> <!ELEMENT Name (#PCDATA)> <!ELEMENT Director (#PCDATA)> <!ELEMENT Actors (#PCDATA)> <!ELEMENT Price (VHSPrice, DVDPrice)> <!ELEMENT VHSPrice (#PCDATA)> <!ELEMENT DVDPrice (#PCDATA)>
Schema-Guided Wrapper Generation • Using a GUI toolkit, users can map data items in HTML pages to elements in DTD DTD tree HTML page
Schema-Guided Wrapper Generation • Internally, the system computes the mappings from the corresponding HTML tree to the DTD tree • Then generates the extraction rule HTML tree DTD tree
Paths to the data items Value of the data item Expressing Extraction Rule in XQuery • Each rule is an FLWR XQuery expression Example FOR $vedio IN $vedioList/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1] RETURN <vedio> { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } </vedio>
Annotations for data items • Describe the semantic meaning of a data item • Indicate the location of the data item • Specified by the user using the GUI • Recorded in the function of “contains(pathToAnnotation, annotationValue)” in XPath /body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td[1]/text()[0][contains(null,"directed by")]
Outline • Motivation • System Overview • Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments • Related Work and Conclusion
Intuition of the approach • The page structure could change • Observation: many “features” of data items are more static, e.g.: • Hyperlink • Annotation • Pattern • These features can help us find the new places of the old data items
Step 1: Data-feature discovery • Compute features of the data items in the original page
Data-Pattern Feature • A syntactic feature • Represented as a regular expression • E.g. $ 15.38 [$][0-9]{0,}[0-9](.)[0-9]{2} • Can be extracted using existing technologies, e.g., [Brin98], [GHQR98], [LM00]
Get annotation and hyperlink information from the original page Checking the XQuery based extraction rule Hyperlink: step of “…/a/…” in the path Annotation: function of “contains()” Hyperlink Indication Annotation Value Path from data item to annotation Annotations and Hyperlinks { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } { LET $actors = $vedio/text()[contains( /preceding-sibling::b[0] ,"Featuring")] RETURN <actors> $actors </actors> }
Step 2: Data-Item Recovery • Traverse the new HTML tree following the depth-first traversal order • Use the old features to identify potential data items using 3 matching conditions: • Hyperlink • Annotation • Data pattern
Example [A-Z][a-z]{0,} ok ok Check hyperlink Check data pattern Recognize a data item Find value starting from annotation yes Recognize a data item Find annotation Check data pattern [$][0-9]{0,}[0-9](.)[0-9]{2}
Results of Data Item Recovery • A mapping list including all the recognized data items • Each mapping contains • Value of the data item • Path to it in the HTML tree • Path of the corresponding DTD element A sample mapping: M1’ (D: “May”, HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0], SP: VideoList/Video/Name )
Observation: Data items are located in semantic blocks Conforms to the user-defined schema Data items are grouped in semantic blocks Step 3: Block Configuration Partial-Match Full-Match Over-Match
Computing “Full Match” Blocks • Identify the level in a top-down manner • Check the level by recursively considering the matches between candidate blocks and the schema “Full match” blocks
Results of Block Configuration • A set of blocks that can fully match with the DTD • Each of them is represented as a list of mappings Examples
Step 4: Rule Re-Induction • Semantic blocks contain mappings from data items in HTML to DTD elements • Induce new extraction rule by calling the induction algorithm in wrapper generator • Refine the rule by trying to ensure the extraction rule cover all other semantic blocks • Generalization is necessary
Outline • Motivation • System Overview • Wrapper Maintenance (four steps): • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments • Related Work and Conclusion
Web Sources • From October 2002 toMay 2003 • Collected Web page changes • From 16 data-intensive sites • Using site search engine or from the same URL • All the pages have complex table structures • Observed changes • Data items (add, delete, modify) • Table structure non-table structure • Complex table structure re-arrangement
Experiment Procedures New Web Docs Original Web Docs step1 Wrapper Repository Wrapper Generator Original Wrappers step2 Repaired Wrappers Check Extraction Results ……… Changed pages step3 Wrapper Maintainer
Experiment Metrics • Recall (R) • Proportion of the correctly extracted data items of all the data items that should be extracted • Precision (P) • Proportion of the correctly extracted data items of all the data items that have been extracted
Related Work on Wrapper Maintenance • [Kushmerick 99] • Using simple numeric features of the extracted strings • [Lerman K., Minton S. 00] • Using the starting and ending strings as the description of the data fields • [Chidlovskii B. 01] • Syntactic features of data items to be extracted, and semantic features: URL, time strings, entities…
Comparions • These approaches heavily rely on the syntactic features of the data items, and may not precisely recognize data items.
Conclusion • SG-WRAM: a wrapper-maintenance system • Intuition: use features that are more stable • Pattern • Hyperlink • Annotation • Four steps of the approach: • Data-Feature Discovery • Item Recovery • Block Configuration • Rule Re-induction • Experiments showed that it is effective
Thank you! Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA