Schema matching and data extraction over html tables
Download
1 / 28

Schema Matching and Data Extraction over HTML Tables - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Schema Matching and Data Extraction over HTML Tables. Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University. supported by NSF. Introduction. Many tables on the Web How to integrate data stored in different tables? Detect the table of interest

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Schema Matching and Data Extraction over HTML Tables' - ely


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Schema matching and data extraction over html tables

Schema Matching and Data Extraction over HTML Tables

Cui Tao

Data Extraction Research Group

Department of Computer Science

Brigham Young University

supported by NSF


Introduction
Introduction

  • Many tables on the Web

  • How to integrate data stored in different tables?

    • Detect the table of interest

    • Form attribute-value pairs (adjust if necessary)

    • Do extraction

    • Infer mappings from extraction patterns


Problem detecting the table of interest

?

ProblemDetecting The Table of Interest


Problem

Different schemas

  • Different source table schemas

    • {Run #, Yr, Make, Model, Tran, Color, Dr}

    • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD}

    • {Vehicle, Distance, Price, Mileage}

    • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}

  • Target database schema

    {Car, Year, Make, Model, Mileage, Price, PhoneNr},

    {Car, Feature}


Problem attribute is value
ProblemAttribute is Value


Problem attribute value is value

?

?

Problem Attribute-Value is Value


Problem value is not value
ProblemValue is not Value


Problem factored values
ProblemFactored Values


Problem split values
ProblemSplit Values


Problem merged values
ProblemMerged Values


Problem information behind links

Table

extending

over several

pages

Single-Column

Table (formatted

as list)

ProblemInformation Behind Links


Solution
Solution

  • Detect the table of interest

  • Form attribute-value pairs (adjust if necessary)

  • Do extraction

  • Infer mappings from extraction patterns


Solution detect the table of interest
SolutionDetect The Table of Interest

  • ‘Real’ table test

    • Same number of values

    • Table size

  • Attribute test

  • Density measure test

    # of ontology extracted values

    total # of values in the table


Solution remove factoring

2001

2001

2001

2000

2000

2000

2000

2000

2000

1999

1999

Solution Remove Factoring


Solution replace boolean values
SolutionReplace Boolean Values


Solution form attribute value pairs
SolutionForm Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,

<Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>


Solution adjust attribute value pairs
SolutionAdjust Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,

<Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>


Solution add information hidden behind links

Unstructured and

semi-structured:

concatenate

<

Single attribute value pairs:

Pair them together

<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>

List:

Mark the beginning

and the end

>

SolutionAdd Information Hidden Behind Links


Solution inferred mapping creation
SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}


Solution inferred mapping creation1

Each row is a car.

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}


Solution inferred mapping creation2
SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}


Solution inferred mapping creation3
SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}


Solution inferred mapping creation4
SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}


Solution inferred mapping creation5
SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}


Solution inferred mapping creation6
SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}


Experimental results
Experimental Results

Car Advertisement Application domain

  • 10 “training” tables

    • 100% of the 57 mappings (no false mappings)

    • 94.6% precision of the values in linked pages (5.4% false declarations)

  • 50 test tables

    • 94.7% of the 300 mappings (no false mappings)

    • On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision


Other applications
Other Applications

  • Cell Phone Plan Application domain

  • Soccer Player Application domain


Contribution
Contribution

  • Provides an approach to extract information automatically from HTML tables

  • Suggests a different way to solve the problem of schema matching


ad