1 / 41

Harvesting Relational Tables from Lists on the Web

Harvesting Relational Tables from Lists on the Web. Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc. Outline. Introduction The ListExtract Approach Experiments Conclusion. Lists on the Web. Lists on the Web. Lists on the Web. Lists on the Web. Our Goal:

lynton
Download Presentation

Harvesting Relational Tables from Lists on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harvesting Relational Tables from Lists on the Web Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc.

  2. Outline • Introduction • The ListExtract Approach • Experiments • Conclusion

  3. Lists on the Web

  4. Lists on the Web

  5. Lists on the Web

  6. Lists on the Web • Our Goal: • Extract tabular data from all such lists in an unsupervised and domain-independent manner. • Not the typical wrapper generation problem.

  7. Easy for Humans • Confusing for • Machines Cartoons Example A period (“.”) is used both as a delimiter and to terminate abbreviations A slash (“/”) is used both as a delimiter and as part of the text The slash (“/”) delimiter is missing (along with the prod. year)

  8. Key Contributions • Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner • Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions • Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.

  9. Outline • Introduction • The ListExtract Approach • Experiments • Conclusion

  10. ListExtract Approach Deciding the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

  11. Intermediate Outputs (Independent Splitting Phase)

  12. Intermediate Outputs (Re-Splitting Long Records) Number of Columns = 4

  13. Intermediate Outputs (Alignment Phase)

  14. Final Output (Refinement Phase)

  15. ListExtract Approach Deciding the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

  16. Input Output Line Splitting Algorithm FQ Score Subsequence √ √ pre-processing: (removing delimiters) √ √

  17. Field Quality (FQ) Score • Linear Combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc) • Score: 1 if match found, 0 otherwise • Table Corpus • Check if candidate sequence existed as a field in the table corpus • Score: 1 if exists, 0 otherwise • Language Model • Measure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text. • Score: a combination of the probabilities capturing both the likelihood and unlikelihood

  18. ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Majority Voting across all records Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

  19. ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

  20. Input Output Re-Splitting Long Records FQ Score Subsequence √ √ pre-processing: (removing delimiters) √ √ Maximum Number of Output Fields = 3

  21. ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

  22. Aligning Short Records(Null Insertion) Avg. FQ Score Independently Split Records 0.88 0.79 0.49 0.62 0.73 0.92 0.86

  23. Aligning Short Records(Null Insertion) Avg. FQ Score Independently Split Records Output Table 0.92 0.86 0.79 0.62 0.88 0.73 0.49 1- Sorting 2- Iterative Alignment

  24. Aligning Short Records(Null Insertion) • To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm. [NW, J. of Molecular Biology, 1970] • The two sequences: • Sequence #1: Table columns • Sequence #2: Fields of a short record • Design a Field-to-Field Consistency (F2FC) Score. • Use the average F2FC Score as the similarity measure for the alignment algorithm.

  25. Field-to-Field Consistency (F2FC) Score • Linear combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Check if data types are consistent • Table Corpus • Check if two fields co-occur in the same column in a table in the corpus • Syntax • Measure the consistency of the syntax of the two fields (e.g. length, % of upper/lower case letters, digits, spaces, etc) • Delimiters • Measures the consistency between the delimiters on both sides of the two fields

  26. ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

  27. Refinement Phase Output Table

  28. Refinement Phase Output Table Detect Inconsistent Fields

  29. Refinement Phase Output Table Detect Inconsistent Fields Consider streaks only

  30. Refinement Phase Output Table Detect Inconsistent Fields Consider streaks only Re-merge

  31. Refinement Phase Output Table • Detect Inconsistent Fields • Consider streaks only • Re-merge • Re-split (and re-align if needed) • Use extended FQ score

  32. Field Quality (FQ) Score[Revisited] • Linear Combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Table Corpus • Language Model • List Support • favors candidates which are more consistent with the columns spanned by the streak

  33. ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

  34. Table Extraction (TE) Score • Average FQ Score for all fields in the extracted table • Used to compare between and rank the extracted tables based on their extraction quality

  35. Outline • Introduction • The ListExtract Approach • Experiments • Conclusion

  36. Overall Performance for WLists and TDLists • WLists: A set of 20 manually-collected HTML lists spanning 20 different domains. • TDLists: A set of 100 lists derived from randomly-selected HTML tables

  37. Effect of the Refinement Phase(WLists)

  38. Large-Scale Experiment A crawl of 100K web pages (0.45, ~10,300 tables) 100K extracted lists 32K lists after filtering (0.65, ~1,000 tables) 11K extracted tables with multiple columns

  39. Outline • Introduction • The ListExtract Approach • Experiments • Conclusion

  40. Conclusion • Our work is a continuation of the efforts to extract structured data from the Web. • Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions. • Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.

  41. Thank you Questions?

More Related