Learning to map between schemas ontologies
Download
1 / 47

Learning to Map Between Schemas Ontologies - PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on

Learning to Map Between Schemas Ontologies. Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos. Agenda. Ontology mapping is a key problem in many applications: Data integration Semantic web Knowledge management E-commerce LSD:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning to Map Between Schemas Ontologies' - allayna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning to map between schemas ontologies

Learning to Map Between Schemas Ontologies

Alon Halevy

University of Washington

Joint work with Anhai Doan and Pedro Domingos


Agenda
Agenda

  • Ontology mapping is a key problem in many applications:

    • Data integration

    • Semantic web

    • Knowledge management

    • E-commerce

  • LSD:

    • Solution that uses multi-strategy learning.

    • We’ve started with schema matching (I.e., very simple ontologies)

    • Currently extending to more expressive ontologies.

    • Experiments show the approach is very promising!


The structure mapping problem
The Structure Mapping Problem

  • Types of structures:

    • Database schemas, XML DTDs, ontologies, …,

  • Input:

    • Two (or more) structures, S1 and S2

    • Data instances for S1 and S2

    • Background knowledge

  • Output:

    • A mapping between S1 and S2

      • Should enable translating between data instances.

    • Semantics of mapping?


Semantic mappings between schemas
Semantic Mappings between Schemas

  • Source schemas = XML DTDs

house

address

contact-info

num-baths

agent-nameagent-phone

1-1 mapping

non 1-1 mapping

house

location contact

full-baths

half-baths

name phone


Motivation
Motivation

  • Database schema integration

    • A problem as old as databases themselves.

    • database merging, data warehouses, data migration

  • Data integration / information gathering agents

    • On the WWW, in enterprises, large science projects

  • Model management:

    • Model matching: key operator in an algebra where models and mappings are first-class objects.

    • See [Bernstein et al., 2000] for more.

  • The Semantic Web

    • Ontology mapping.

  • System interoperability

    • E-services, application integration, B2B applications, …,


Desiderata from proposed solutions
Desiderata from Proposed Solutions

  • Accuracy, efficiency, ease of use.

  • Realistic expectations:

    • Unlikely to be fully automated. Need user in the loop.

  • Some notion of semantics for mappings.

  • Extensibility:

    • Solution should exploit additional background knowledge.

  • “Memory”, knowledge reuse:

    • System should exploit previous manual or automatically generated matchings.

    • Key idea behind LSD.


Lsd overview
LSD Overview

  • L(earning) S(ource) D(escriptions)

  • Problem: generating semantic mappings between mediated schema and a large set of data source schemas.

  • Key idea: generate the first mappings manually, and learn from them to generate the rest.

  • Technique: multi-strategy learning (extensible!)

  • Step 1:

    • [SIGMOD, 2001]: 1-1 mappings between XML DTDs.

  • Current focus:

    • Complex mappings

    • Ontology mapping.


Outline
Outline

  • Overview of structure mapping

  • Data integration and source mappings

  • LSD architecture and details

  • Experimental results

  • Current work.


Data integration
Data Integration

Find houses with four bathrooms priced under $500,000

mediated schema

Query reformulation

and optimization.

source schema 1

source schema 2

source schema 3

wrappers

realestate.com

homeseekers.com

homes.com

Applications: WWW, enterprises, science projects

Techniques: virtual data integration, warehousing, custom code.


Semantic mappings between schemas1
Semantic Mappings between Schemas

  • Source schemas = XML DTDs

house

address

contact-info

num-baths

agent-nameagent-phone

1-1 mapping

non 1-1 mapping

house

location contact

full-baths

half-baths

name phone


Semantics preliminary
Semantics (preliminary)

  • Semantics of mappings has received no attention.

  • Semantics of 1-1 mappings –

  • Given:

    • R(A1,…,An) and S(B1,…,Bm)

    • 1-1 mappings (Ai,Bj)

  • Then, we postulate the existence of a relation W, s.t.:

    • P(C1,…,Ck) (W) = P(A1,…,Ak) (R) ,

    • P(C1,…,Ck) (W) = P(B1,…,Bk) (S) ,

    • W also includes the unmatched attributes of R and S.

  • In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.


Why matching is difficult
Why Matching is Difficult

  • Aims to identify same real-world entity

    • using names, structures, types, data values, etc

  • Schemas represent same entity differently

    • different names => same entity:

      • area & address => location

    • same names => different entities:

      • area => location or square-feet

  • Schema & data never fully capture semantics!

    • not adequately documented, not sufficiently expressive

  • Intended semantics is typically subjective!

    • IBM Almaden Lab = IBM?

  • Cannot be fully automated. Often hard for humans. Committees are required!


Current state of affairs
Current State of Affairs

  • Finding semantic mappings is now the bottleneck!

    • largely done by hand

    • labor intensive & error prone

    • GTE: 4 hours/element for 27,000 elements [Li&Clifton00]

  • Will only be exacerbated

    • data sharing & XML become pervasive

    • proliferation of DTDs

    • translation of legacy data

    • reconciling ontologies on semantic web

  • Need semi-automatic approaches to scale up!


Outline1
Outline

  • Overview of structure mapping

  • Data integration and source mappings

  • LSD architecture and details

  • Experimental results

  • Current work.


The lsd approach
The LSD Approach

  • User manually maps a few data sources to the mediated schema.

  • LSD learns from the mappings, and proposes mappings for the rest of the sources.

  • Several types of knowledge are used in learning:

    • Schema elements, e.g., attribute names

    • Data elements: ranges, formats, word frequencies, value frequencies, length of texts.

    • Proximity of attributes

    • Functional dependencies, number of attribute occurrences.

  • One learner does not fit all. Use multiple learners and combine with meta-learner.


Example
Example

Mediated schema

address price agent-phone description

locationlisted-pricephonecomments

Learned hypotheses

If “phone” occurs in the name => agent-phone

Schema of realestate.com

location

Miami, FL

Boston, MA

...

listed-price

$250,000

$110,000

...

phone

(305) 729 0831

(617) 253 1429

...

comments

Fantastic house

Great location

...

realestate.com

If “fantastic” & “great” occur frequently in data values =>

description

homes.com

price

$550,000

$320,000

...

contact-phone

(278) 345 7215

(617) 335 2315

...

extra-info

Beautiful yard

Great beach

...


Multi strategy learning
Multi-Strategy Learning

  • Use a set of baselearners:

    • Name learner, Naïve Bayes, Whirl, XML learner

  • And a set of recognizers:

    • County name, zip code, phone numbers.

  • Each base learner produces a prediction weighted by confidence score.

  • Combine base learners with a meta-learner, using stacking.


Base learners
Base Learners

  • Name Learner

(contact-info,office-address)

(contact-info,office-address)

(contact,agent-phone)

(contact,agent-phone)

(contact-phone, ? )

(phone,agent-phone)

(phone,agent-phone)

(listed-price,price)

(listed-price,price)

  • contact-phone => (agent-phone,0.7), (office-address,0.3)

  • Naive Bayes Learner[Domingos&Pazzani 97]

    • “Kent, WA” => (address,0.8), (name,0.2)

  • Whirl Learner[Cohen&Hirsh 98]

  • XML Learner

    • exploits hierarchical structure of XML data


Training the base learners
Training the Base Learners

Mediated schema

address price agent-phone description

locationlisted-pricephonecomments

Schema of realestate.com

Name Learner

<location> Miami, FL </>

<listed-price> $250,000</>

<phone> (305) 729 0831</>

<comments> Fantastic house </>

(location, address)

(listed-price, price)

(phone, agent-phone)

...

realestate.com

Naive Bayes Learner

<location> Boston, MA </>

<listed-price> $110,000</>

<phone> (617) 253 1429</>

<comments> Great location </>

(“Miami, FL”, address)

(“$ 250,000”, price)

(“(305) 729 0831”, agent-phone)

...


Entity recognizers
Entity Recognizers

  • Use pre-programmed knowledge to identify specific types of entities

    • date, time, city, zip code, name, etc

    • house-area (30 X 70, 500 sq. ft.)

    • county-name recognizer

  • Recognizers often have nice characteristics

    • easy to construct

    • many off-the-self research & commercial products

    • applicable across many domains

    • help with special cases that are hard to learn


Meta learner stacking
Meta-Learner: Stacking

  • Training of meta-learner produces a weight for every pair of:

    • (base-learner, mediated-schema element)

    • weight(Name-Learner,address) = 0.1

    • weight(Naive-Bayes,address) = 0.9

  • Combining predictions of meta-learner:

    • computes weighted sum of base-learner confidence scores

Name Learner

Naive Bayes

(address,0.6)

(address,0.8)

<area>Seattle, WA</>

Meta-Learner

(address, 0.6*0.1 + 0.8*0.9 = 0.78)


Training the meta learner
Training the Meta-Learner

  • For address

Name Learner

Naive Bayes

True Predictions

Extracted XML Instances

<location> Miami, FL</>

<listed-price> $250,000</>

<area> Seattle, WA </>

<house-addr>Kent, WA</>

<num-baths>3</>

...

0.5 0.8 1

0.4 0.3 0

0.3 0.9 1

0.6 0.8 1

0.3 0.3 0

... ... ...

Least-SquaresLinear Regression

Weight(Name-Learner,address) = 0.1

Weight(Naive-Bayes,address) = 0.9


Applying the learners
Applying the Learners

Mediated schema

Schema of homes.com

address price agent-phone description

area day-phone extra-info

Name Learner

Naive Bayes

<area>Seattle, WA</>

<area>Kent, WA</>

<area>Austin, TX</>

(address,0.8), (description,0.2)

(address,0.6), (description,0.4)

(address,0.7), (description,0.3)

Meta-Learner

Name Learner

Naive Bayes

Meta-Learner

(address,0.7), (description,0.3)

<day-phone>(278) 345 7215</>

<day-phone>(617) 335 2315</>

<day-phone>(512) 427 1115</>

(agent-phone,0.9), (description,0.1)

(description,0.8), (address,0.2)

<extra-info>Beautiful yard</>

<extra-info>Great beach</>

<extra-info>Close to Seattle</>


The constraint handler
The Constraint Handler

  • Extends learning to incorporate constraints

    • hard constraints

      • a = address & b = addressa = b

      • a = house-ida is a key

      • a = agent-info & b = agent-nameb is nested in a

    • soft constraints

      • a= agent-phone &b= agent-name a&bare usually close to each other

    • user feedback = hard or soft constraints

  • Details in [Doan et. al., SIGMOD 2001]


The current lsd system
The Current LSD System

Training Phase

Matching Phase

Mediated schema

Source schemas

Domain

Constraints

Data listings

User Feedback

Constraint Handler

Base-Learner1

Base-Learnerk

Meta-Learner

Mappings


Outline2
Outline

  • Overview of structure mapping

  • Data integration and source mappings

  • LSD architecture and details

  • Experimental results

  • Current work.


Empirical evaluation
Empirical Evaluation

  • Four domains

    • Real Estate I & II, Course Offerings, Faculty Listings

  • For each domain

    • create mediated DTD & domain constraints

    • choose five sources

    • extract & convert data listings into XML (faithful to schema!)

    • mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48

  • Ten runs for each experiment - in each run:

    • manually provide 1-1 mappings for 3 sources

    • ask LSD to propose mappings for remaining 2 sources

    • accuracy = % of 1-1 mappings correctly identified


Matching accuracy
Matching Accuracy

Average Matching Acccuracy (%)

LSD’s accuracy: 71 - 92%

Best single base learner: 42 - 72%

+ Meta-learner: + 5 - 22%

+ Constraint handler: + 7 - 13%

+ XML learner: + 0.8 - 6%


Sensitivity to amount of available data
Sensitivity to Amount of Available Data

Average matching accuracy (%)

Number of data listings per source (Real Estate I)


Contribution of schema vs data
Contribution of Schema vs. Data

LSD with only schema info.

LSD with only data info.

Complete LSD

Average matching accuracy (%)

  • More experiments in the paper [Doan et. al. 01]


Reasons for incorrect matching
Reasons for Incorrect Matching

  • Unfamiliarity

    • suburb

    • solution: add a suburb-name recognizer

  • Insufficient information

    • correctly identified general type, failed to pinpoint exact type

    • <agent-name>Richard Smith</><phone> (206) 234 5412 </>

    • solution: add a proximity learner

  • Subjectivity

    • house-style = description?


Outline3
Outline

  • Overview of structure mapping

  • Data integration and source mappings

  • LSD architecture and details

  • Experimental results

  • Current work.


Moving up the expressiveness ladder
Moving Up the Expressiveness Ladder

  • Schemas are very simple ontologies.

  • More expressive power = More domain constraints.

  • Mappings become more complex, but constraints provide more to learn from.

  • Non 1-1 mappings:

    • F1(A1,…,Am) = F2(B1,…,Bm)

  • Ontologies (of various flavors):

    • Class hierarchy (I.e., containment on unary relations)

    • Relationships between objects

    • Constraints on relationships


Finding non 1 1 mappings current work
Finding Non 1-1 MappingsCurrent work

  • Given two schemas, find

    • 1-many mappings: address = concat(city,state)

    • many-1: half-baths + full-baths = num-baths

    • many-many: concat(addr-line1,addr-line2) = concat(street,city,state)

  • 1-many mappings

    • expressed as query

      • value correspondence expression: room-rate = rate * (1 + tax-rate)

      • relationship: state of tax-rate = state of hotel that has rate

    • special case: 1-many mappings between two relational tables

Mediated schema

Source schema

address description num-baths

city state comments half-baths full-baths


Brute force solution
Brute-Force Solution

  • Define a set of operators

    • concat, +, -, *, /, etc

  • For each set of mediated-schema columns

    • enumerate all possible mappings

    • evaluate & return best mapping

Mediated-schema columns

Source-schema columns

compute similarity

using all base learners

m1

m1, m2, ..., mk


Search based solution
Search-Based Solution

  • States = columns

    • goal state: mediated-schema column

    • initial states: all source-schema columns

      • use 1-1 matching to reduce the set of initial states

  • Operators: concat, +, -, *, /, etc

  • Column-similarity:

    • use all base learners + recognizers


Multi strategy search
Multi-Strategy Search

  • Use a set of expert modules: L1, L2, ..., Ln

  • Each module

    • applies to only certain types of mediated-schema column

    • searches a small subspace

    • uses a cheap similarity measure to compare columns

  • Example

    • L1: text; concat; TF/IDF

    • L2: numeric; +, -, *, /; [Ho et. al. 2000]

    • L3: address; concat; Naive Bayes

  • Search techniques

    • beam search as default

    • specialized, do not have to materialize columns


Multi strategy search cont d
Multi-Strategy Search (cont’d)

  • Combine modules’ predictions & select the best one

  • Apply all applicable expert modules

L1: m11, m12, m13, ..., m1x

L2: m21, m22, m23, ..., m2y

L3: m31, m32, m33, ..., m3z

compute similarity

using all base learners

m11

m11, m12,

m21, m22,

m31,m32


Related work
Related Work

Recognizers + Schema + 1-1 Matching

Single Learner + 1-1 Matching

TRANSCM [Milo&Zohar98]

ARTEMIS [Castano&Antonellis99]

[Palopoli et. al. 98]

CUPID [Madhavan et. al. 01]

SEMINT [Li&Clifton94]

ILA [Perkowitz&Etzioni95]

DELTA [Clifton et. al. 97]

Hybrid + 1-1 Matching

DELTA [Clifton et. al. 97]

Multi-Strategy Learning

Learners + Recognizers

Schema + Data

1-1 + non 1-1 Matching

Schema + Data

1-1 + non 1-1 Matching

Sophisticated Data-Driven User Interaction

CLIO [Miller et. al. 00],[Yan et. al. 01]

LSD [Doan et. al. 2000, 2001]

?


Summary
Summary

  • LSD:

    • uses multi-strategy learning to semi-automatically generate semantic mappings.

    • LSD is extensible and incorporates domain and user knowledge, and previous techniques.

    • Experimental results show the approach is very promising.

  • Future work and issues to ponder:

    • Accommodating more expressive languages: ontologies

    • Reuse of learned concepts from related domains.

    • Semantics?

  • Data management is a fertile area for Machine Learning research!



Mapping maintenance
Mapping Maintenance

  • Ten months later ...

    • are the mappings still correct?

Mediated-schema M

Source-schema S

m1

m2

m3

Mediated-schema M’

Source-schema S’

m1

m2

m3


Information extraction from text
Information Extraction from Text

  • Extract data fragments from text documents

    • date, location, & victim’s name from a news article

  • Intensive research on free-text documents

  • Many documents do have substantial structure

    • XML pages, name card, tables, list

  • Each such document = a data source

    • structure forms a schema

    • only one data value per schema element

    • “real” data source has many data values per schema element

  • Ongoing research in the IE community


Contribution of each component
Contribution of Each Component

Average Matching Acccuracy (%)

Without Name Learner

Without Naive Bayes

Without Whirl Learner

Without Constraint Handler

The complete LSD system


Exploiting hierarchical structure
Exploiting Hierarchical Structure

  • Existing learners flatten out all structures

  • Developed XML learner

    • similar to the Naive Bayes learner

      • input instance = bag of tokens

    • differs in one crucial aspect

      • consider not only text tokens, but also structure tokens

<contact>

<name> Gail Murphy </name>

<firm> MAX Realtors </firm>

</contact>

<description>

Victorian house with a view. Name your price!

To see it, contact Gail Murphy at MAX Realtors.

</description>


Domain constraints
Domain Constraints

  • Impose semantic regularities on sources

    • verified using schema or data

  • Examples

    • a = address & b = addressa = b

    • a = house-ida is a key

    • a = agent-info & b = agent-nameb is nested in a

  • Can be specified up front

    • when creating mediated schema

    • independent of any actual source schema


The constraint handler1
The Constraint Handler

  • Can specify arbitrary constraints

  • User feedback = domain constraint

    • ad-id = house-id

  • Extended to handle domain heuristics

    • a = agent-phone & b = agent-namea & b are usually close to each other

Predictions from Meta-Learner

Domain Constraints

a = address & b = adderssa = b

area: (address,0.7), (description,0.3)

contact-phone: (agent-phone,0.9), (description,0.1)

extra-info: (address,0.6), (description,0.4)

0.3

0.1

0.4

0.012

area: address

contact-phone: agent-phone

extra-info: address

area: address

contact-phone: agent-phone

extra-info: description

0.7

0.9

0.6

0.378

0.7

0.9

0.4

0.252


ad