Bootstrapping information extraction from semi-structured web pages

Bootstrapping information extraction from semi-structured web pages Andrew Carson and Charles Schafer

Abstract • No human supervision required system • Previous work: • Required significant human effort • Their solution: • Requiring 2-5 annotated pages fro 4-6 web sites for training model • No human supervision for the garget web site • Result: • 83.8% and 91.1% for different sites.

Introduction • Extracting structured records from detail pages of semi-structured web pages

Introduction • Why semi-structured web • Great sources of information • Attribute/value structure: downstream learning or querying systems

Related Work • Problem of Previous Work • No labeling example pages, but manual labeling of the output • Irrelevant fields(20 data fields and 7 schema columns) • Dela system(automatically label extracted data) • Problem of labeling detected data fields • A data field does not have a label • Multiple fields of the same data type

Methods • Terms: • Domain schema: a set of attributes • Schema column: a single attribute • Detailed page: a page that corresponds to a single data record • Data field: a location within a template for that site • Data values: an instance of that data field

Methods • Detecting Data Fields • Partial Tree Alignment Algorithm

Methods • Classifying Data Fields • Assign a score to each schema column • c: Data values => data for training schema column • f: data fields => contexts from the training data • Compute the score: • Use a classifier to map data fields to schema column • Use a model • K different feature types

Methods • Feature Types • Precontext character 3-grams • Lowercase value tokens • Lowercase value character 3-grams • Value token types

Methods • Comparing Distributions of Feature Values • Advantage • Similar data values • Avoid over-fitting • when high-dimensional feature spaces • Small number of training example

Methods • KL-Divergence • Smoothed version • Skew Similarity Score

Methods • Combining Skew Similarity Scores • Combine skew similarity scores for the dfferent feature types using linear regression model • Stacked classifier model • Labeling the Target Site • Higher for each schema column c

Evaluation • Accuracy of automatically labeling new sites • How well it make recommendations to human annotators • Input: a collection of annotated sites for a domain • Method: cross-validation

Results by Site

Results by Schema Column

Identifying Missing Schema Columns • Vacation rentals: 80.0% • Job sites: 49.3%

Conclusion

Bootstrapping information extraction from semi-structured web pages

Bootstrapping information extraction from semi-structured web pages

Presentation Transcript

Information Extraction from Web Documents

Querying for relations from the semi-structured Web

Information Extraction from the World Wide Web

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

Open Information Extraction from the Web

Collectively Representing Semi-Structured Data from the Web

Information Extraction: Distilling Structured Data from Unstructured Text.

Information Extraction from the World Wide Web

Implementing Automatic Value Extraction from Structured Web Pages

Structured Information Extraction from Natural Disaster Events on Twitter

Extracting Structured Data from Web Pages

Bootstrapping Information Extraction from Semi-Structured Web Pages

Information Extraction from the World Wide Web

BOEMIE: Bootstrapping Ontology Evolution with Multimedia Information Extraction

Information extraction from web pages using extraction ontologies

Information Extraction from the World Wide Web

Information extraction from web pages using extraction ontologies

The Data Records Extraction from Web Pages