1 / 17

Bootstrapping information extraction from semi-structured web pages

Bootstrapping information extraction from semi-structured web pages. Andrew Carson and Charles Schafer. Abstract. No human supervision required system Previous work: Required significant human effort Their solution: Requiring 2-5 annotated pages fro 4-6 web sites for training model

dino
Download Presentation

Bootstrapping information extraction from semi-structured web pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bootstrapping information extraction from semi-structured web pages Andrew Carson and Charles Schafer

  2. Abstract • No human supervision required system • Previous work: • Required significant human effort • Their solution: • Requiring 2-5 annotated pages fro 4-6 web sites for training model • No human supervision for the garget web site • Result: • 83.8% and 91.1% for different sites.

  3. Introduction • Extracting structured records from detail pages of semi-structured web pages

  4. Introduction • Why semi-structured web • Great sources of information • Attribute/value structure: downstream learning or querying systems

  5. Related Work • Problem of Previous Work • No labeling example pages, but manual labeling of the output • Irrelevant fields(20 data fields and 7 schema columns) • Dela system(automatically label extracted data) • Problem of labeling detected data fields • A data field does not have a label • Multiple fields of the same data type

  6. Methods • Terms: • Domain schema: a set of attributes • Schema column: a single attribute • Detailed page: a page that corresponds to a single data record • Data field: a location within a template for that site • Data values: an instance of that data field

  7. Methods • Detecting Data Fields • Partial Tree Alignment Algorithm

  8. Methods • Classifying Data Fields • Assign a score to each schema column • c: Data values => data for training schema column • f: data fields => contexts from the training data • Compute the score: • Use a classifier to map data fields to schema column • Use a model • K different feature types

  9. Methods • Feature Types • Precontext character 3-grams • Lowercase value tokens • Lowercase value character 3-grams • Value token types

  10. Methods • Comparing Distributions of Feature Values • Advantage • Similar data values • Avoid over-fitting • when high-dimensional feature spaces • Small number of training example

  11. Methods • KL-Divergence • Smoothed version • Skew Similarity Score

  12. Methods • Combining Skew Similarity Scores • Combine skew similarity scores for the dfferent feature types using linear regression model • Stacked classifier model • Labeling the Target Site • Higher for each schema column c

  13. Evaluation • Accuracy of automatically labeling new sites • How well it make recommendations to human annotators • Input: a collection of annotated sites for a domain • Method: cross-validation

  14. Results by Site

  15. Results by Schema Column

  16. Identifying Missing Schema Columns • Vacation rentals: 80.0% • Job sites: 49.3%

  17. Conclusion

More Related