learning to map between structured representations of data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning to Map between Structured Representations of Data PowerPoint Presentation
Download Presentation
Learning to Map between Structured Representations of Data

Loading in 2 Seconds...

play fullscreen
1 / 48

Learning to Map between Structured Representations of Data - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

Learning to Map between Structured Representations of Data. AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002. Data Integration Challenge. Find houses with 2 bedrooms priced under 200K. New faculty member. realestate.com. homeseekers.com.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning to Map between Structured Representations of Data' - kaz


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning to map between structured representations of data

Learning to Map between Structured Representations of Data

AnHai Doan

Database & Data Mining Group

University of Washington, Seattle

Spring 2002

data integration challenge
Data Integration Challenge

Find houses with

2 bedrooms

priced under

200K

New faculty

member

realestate.com

homeseekers.com

homes.com

architecture of data integration system
Architecture of Data Integration System

Find houses with 2 bedrooms

priced under 200K

mediated schema

source schema 1

source schema 2

source schema 3

realestate.com

homeseekers.com

homes.com

semantic mappings between schemas
Semantic Mappings between Schemas

Mediated-schema

price agent-name address

1-1 mapping

complex mapping

homes.com

listed-price contact-name city state

320K Jane Brown Seattle WA

240K Mike Smith Miami FL

schema matching is ubiquitous
Schema Matching is Ubiquitous!
  • Fundamental problem in numerous applications
  • Databases
    • data integration
    • data translation
    • schema/view integration
    • data warehousing
    • semantic query processing
    • model management
    • peer data management
  • AI
    • knowledge bases, ontology merging, information gathering agents, ...
  • Web
    • e-commerce
    • marking up data using ontologies (Semantic Web)
why schema matching is difficult
Why Schema Matching is Difficult
  • Schema & data never fully capture semantics!
    • not adequately documented
  • Must rely on clues in schema & data
    • using names, structures, types, data values, etc.
  • Such clues can be unreliable
    • same names => different entities: area => location or square-feet
    • different names => same entity: area & address => location
  • Intended semantics can be subjective
    • house-style = house-description?
  • Cannot be fully automated, needs user feedback!
current state of affairs
Current State of Affairs
  • Finding semantic mappings is now a key bottleneck!
    • largely done by hand
    • labor intensive & error prone
    • data integration at GTE [Li&Clifton, 2000]
      • 40 databases, 27000 elements, estimated time: 12 years
  • Will only be exacerbated
    • data sharing becomes pervasive
    • translation of legacy data
  • Need semi-automatic approaches to scale up!
  • Many current research projects
    • Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, ...
    • AI: Stanford, Karlsruhe University, NEC Japan, ...
goals and contributions
Goals and Contributions
  • Vision for schema-matching tools
    • learn from previous matching activities
    • exploit multiple types of information
    • incorporate domain integrity constraints
    • handle user feedback
  • My contributions: solution for semi-automatic schema matching
    • can match relational schemas, DTDs, ontologies, ...
    • discovers both 1-1 & complex mappings
    • highly modular & extensible
    • achieves high matching accuracy (66 -- 97%) on real-world data
road map
Road Map
  • Introduction
  • Schema matching [SIGMOD-01]
    • 1-1 mappings for data integration
    • LSD(Learning Source Description) system
      • learns from previous matching activities
      • employs multi-strategy learning
      • exploits domain constraints & user feedback
  • Creating complex mappings [Tech. Report-02]
  • Ontology matching [WWW-02]
  • Conclusions
schema matching for data integration the lsd approach
Schema Matching for Data Integration:the LSD Approach

Suppose user wants to integrate 100 data sources

1. User

  • manually creates mappings for a few sources, say 3
  • shows LSD these mappings

2. LSDlearns from the mappings

3. LSDpredicts mappings for remaining 97 sources

learning from the manual mappings
Learning from the Manual Mappings

Mediated schema

price agent-name agent-phone office-phone description

If “office” occurs in name

=> office-phone

listed-pricecontact-namecontact-phoneofficecomments

Schema of realestate.com

realestate.com

listed-price contact-name contact-phone office comments

$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house

$320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

If “fantastic” & “great” occur frequently in data instances

=> description

homes.com

sold-at contact-agent extra-info

$350K (206) 634 9435 Beautiful yard

$230K (617) 335 4243 Close to Seattle

must exploit multiple types of information
Must Exploit Multiple Types of Information!

Mediated schema

price agent-name agent-phone office-phone description

If “office” occurs in name

=> office-phone

listed-pricecontact-namecontact-phoneofficecomments

Schema of realestate.com

realestate.com

listed-price contact-name contact-phone office comments

$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house

$320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

If “fantastic” & “great” occur frequently in data instances

=> description

homes.com

sold-at contact-agent extra-info

$350K (206) 634 9435 Beautiful yard

$230K (617) 335 4243 Close to Seattle

multi strategy learning
Multi-Strategy Learning
  • Use a set of base learners
    • each exploits well certain types of information
  • To match a schema element of a new source
    • apply base learners
    • combine their predictions using a meta-learner
  • Meta-learner
    • uses training sources to measure base learner accuracy
    • weighs each learner based on its accuracy
base learners

Observed label

(X1,C1)

(X2,C2)

...

(Xm,Cm)

Object

Classification model

(hypothesis)

Training

examples

Base Learners
  • Training
  • Matching
  • Name Learner
    • training: (“location”, address) (“contactname”, name)
    • matching: agent-name => (name,0.7),(phone,0.3)
  • Naive Bayes Learner
    • training: (“Seattle, WA”,address) (“250K”,price)
    • matching: “Kent, WA” => (address,0.8),(name,0.2)

labels weighted by confidence score

X

the lsd architecture
The LSD Architecture

Training Phase

Matching Phase

Mediated schema

Source schemas

Training data

for base learners

Base-Learner1 .... Base-Learnerk

Meta-Learner

Base-Learner1

Base-Learnerk

Predictions for instances

Hypothesis1

Hypothesisk

Prediction Combiner

Domain

constraints

Predictions for elements

Constraint Handler

Weights for

Base Learners

Meta-Learner

Mappings

training the base learners

Naive Bayes Learner

Name Learner

(“Miami, FL”, address)

(“$250K”, price)

(“James Smith”, agent-name)

(“(305) 729 0831”, agent-phone)

(“(305) 616 1822”, office-phone)

(“Fantastic house”, description)

(“Boston,MA”, address)

(“location”, address)

(“price”, price)

(“contact name”, agent-name)

(“contact phone”, agent-phone)

(“office”, office-phone)

(“comments”, description)

Training the Base Learners

Mediated schema

addressprice agent-name agent-phone office-phone description

realestate.com

location price contact-name contact-phone office comments

Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house

Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

meta learner stacking wolpert 92 ting witten99
Meta-Learner: Stacking[Wolpert 92,Ting&Witten99]
  • Training
    • uses training data to learn weights
    • one for each (base-learner,mediated-schema element) pair
    • weight (Name-Learner,address) = 0.2
    • weight (Naive-Bayes,address) = 0.8
  • Matching: combine predictions of base learners
    • computes weighted average of base-learner confidence scores

area

Name Learner

Naive Bayes

(address,0.4)

(address,0.9)

Seattle, WA

Kent, WA

Bend, OR

Meta-Learner

(address, 0.4*0.2 + 0.9*0.8 = 0.8)

the lsd architecture1
The LSD Architecture

Training Phase

Matching Phase

Mediated schema

Source schemas

Training data

for base learners

Base-Learner1 .... Base-Learnerk

Meta-Learner

Base-Learner1

Base-Learnerk

Predictions for instances

Hypothesis1

Hypothesisk

Prediction Combiner

Domain

constraints

Predictions for elements

Constraint Handler

Weights for

Base Learners

Meta-Learner

Mappings

applying the learners
Applying the Learners

homes.com schema

area sold-at contact-agent extra-info

area

Name Learner

Naive Bayes

(address,0.8), (description,0.2)

(address,0.6), (description,0.4)

(address,0.7), (description,0.3)

Meta-Learner

Seattle, WA

Kent, WA

Bend, OR

Name Learner

Naive Bayes

Meta-Learner

Prediction-Combiner

homes.com

(address,0.7), (description,0.3)

sold-at

(price,0.9), (agent-phone,0.1)

contact-agent

(agent-phone,0.9), (description,0.1)

extra-info

(address,0.6), (description,0.4)

domain constraints
Domain Constraints
  • Encode user knowledge about domain
  • Specified by examining mediated schema
  • Examples
    • at most one source-schema element can match address
    • if a source-schema element matches house-id then it is a key
    • avg-value(price) > avg-value(num-baths)
  • Given a mapping combination
    • can verify if it satisfies a given constraint

area: address

sold-at: price

contact-agent: agent-phone

extra-info: address

the constraint handler
The Constraint Handler
  • Searches space of mapping combinations efficiently
  • Can handle arbitrary constraints
  • Also used to incorporate user feedback
    • sold-at does not match price

Predictions from Prediction Combiner

Domain Constraints

At most one element matches address

area: (address,0.7), (description,0.3)

sold-at: (price,0.9), (agent-phone,0.1)

contact-agent: (agent-phone,0.9), (description,0.1)

extra-info: (address,0.6), (description,0.4)

0.3

0.1

0.1

0.4

0.0012

area: address

sold-at: price

contact-agent: agent-phone

extra-info: address

0.7

0.9

0.9

0.6

0.3402

area: address

sold-at: price

contact-agent: agent-phone

extra-info: description

0.7

0.9

0.9

0.4

0.2268

the current lsd system
The Current LSD System
  • Can also handle data in XML format
    • matches XML DTDs
  • Base learners
    • Naive Bayes[Duda&Hart-93, Domingos&Pazzani-97]
      • exploits frequencies of words & symbols
    • WHIRL Nearest-Neighbor Classifier[Cohen&Hirsh KDD-98]
      • employs information-retrieval similarity metric
    • Name Learner [SIGMOD-01]
      • matches elements based on their names
    • County-Name Recognizer [SIGMOD-01]
      • stores all U.S. county names
    • XML Learner [SIGMOD-01]
      • exploits hierarchical structure of XML data
empirical evaluation
Empirical Evaluation
  • Four domains
    • Real Estate I & II, Course Offerings, Faculty Listings
  • For each domain
    • created mediated schema & domain constraints
    • chose five sources
    • extracted & converted data into XML
    • mediated schemas: 14 - 66 elements, source schemas: 13 - 48
  • Ten runs for each domain, in each run:
    • manually provided 1-1 mappings for 3 sources
    • asked LSD to propose mappings for remaining 2 sources
    • accuracy = % of 1-1 mappings correctly identified
high matching accuracy
High Matching Accuracy

Average Matching Acccuracy (%)

LSD’s accuracy: 71 - 92%

Best single base learner: 42 - 72%

+ Meta-learner: + 5 - 22%

+ Constraint handler: + 7 - 13%

+ XML learner: + 0.8 - 6%

contribution of schema vs data
Contribution of Schema vs. Data

LSD with only schema info.

LSD with only data info.

Complete LSD

Average matching accuracy (%)

More experiments in [Doan et al. SIGMOD-01]

lsd summary
LSD Summary
  • LSD
    • learns from previous matching activities
    • exploits multiple types of information
      • by employing multi-strategy learning
    • incorporates domain constraints & user feedback
    • achieves high matching accuracy
  • LSD focuses on 1-1 mappings
  • Next challenge: discover more complex mappings!
    • COMAP (Complex Mapping) system
the comap approach

320K 53211 2 1 Seattle 98105

240K 11578 1 1 Miami 23591

The COMAP Approach
  • For each mediated-schema element
    • searches space of all mappings
    • finds a small set of likely mapping candidates
    • uses LSD to evaluate them
  • To search efficiently
    • employs a specialized searcher for each element type
    • Text Searcher, Numeric Searcher, Category Searcher, ...

Mediated-schema

price num-baths address

homes.com

listed-price agent-id full-baths half-baths city zipcode

the comap architecture doan et al 02
The COMAP Architecture [Doan et al., 02]

Mediated schema

Source schema + data

Searcher1

Searcher2

Searcherk

Mapping candidates

Base-Learner1 .... Base-Learnerk

Meta-Learner

Prediction Combiner

Domain

constraints

Constraint Handler

LSD

Mappings

an example text searcher
An Example: Text Searcher
  • Beam search in space of all concatenation mappings
  • Example: find mapping candidates for address
  • Best mapping candidates for address
    • (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9)

Mediated-schema

price num-baths address

homes.com

listed-price agent-id full-baths half-baths city zipcode

320K 532a 2 1 Seattle 98105

240K 115c 1 1 Miami 23591

concat(agent-id,city)

concat(agent-id,zipcode)

concat(city,zipcode)

532a Seattle

115c Miami

532a 98105

115c 23591

Seattle 98105

Miami 23591

empirical evaluation1
Empirical Evaluation
  • Current COMAP system
    • eight searchers
  • Three real-world domains
    • in real estate & product inventory
    • mediated schema: 6 -- 26 elements, source schema: 16 -- 31
  • Accuracy: 62 -- 97%
  • Sample discovered mappings
    • agent-name = concat(first-name,last-name)
    • area = building-area / 43560
    • discount-cost = (unit-price * quantity) * (1 - discount)
road map1
Road Map
  • Introduction
  • Schema matching
    • LSD system
  • Creating complex mappings
    • COMAP system
  • Ontology matching
    • GLUE system
  • Conclusions
ontology matching
Ontology Matching
  • Increasingly critical for
    • knowledge bases, Semantic Web
  • An ontology
    • concepts organized into a taxonomy tree
    • each concept has
      • a set of attributes
      • a set of instances
    • relations among concepts
  • Matching
    • concepts
    • attributes
    • relations

CS Dept. US

Entity

Undergrad

Courses

Grad

Courses

People

Faculty

Staff

Assistant

Professor

Associate

Professor

Professor

name: Mike Burns

degree: Ph.D.

matching taxonomies of concepts

CS Dept. US

Entity

Undergrad

Courses

Grad

Courses

People

Faculty

Staff

Assistant

Professor

Associate

Professor

Professor

Matching Taxonomies of Concepts

CS Dept. Australia

Entity

Courses

Staff

AcademicStaff

TechnicalStaff

Senior

Lecturer

Lecturer

Professor

constraints in taxonomy matching
Constraints in Taxonomy Matching
  • Domain-dependent
    • at most one node matches department-chair
    • a node that matches professor can not be a child of a node that matches assistant-professor
  • Domain-independent
    • two nodes match if parents & children match
    • if all children of X matches Y, then X also matches Y
    • Variations have been exploited in many restricted settings[Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98],[Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]
  • Challenge: find a general & efficient approach
solution relaxation labeling
Solution: Relaxation Labeling
  • Relaxation labeling [Hummel&Zucker, 83]
    • applied to graph labeling in vision, NLP, hypertext classification
    • finds best label assignment, given a set of constraints
    • starts with initial label assignment
    • iteratively improves labels, using constraints
  • Standard relax. labeling not applicable
    • extended it in many ways [Doan et al., W W W-02]
  • Experiments
    • three real-world domains in course catalog & company listings
    • 30 -- 300 nodes / taxonomy
    • accuracy 66 -- 97% vs. 52 -- 83% of best base learner
    • relaxation labeling very fast (under few seconds)
related work
Related Work

Hand-crafted rules

Exploit schema 1-1 mapping

Single learner

Exploit data

1-1 mapping

TRANSCM [Milo&Zohar98]

ARTEMIS [Castano&Antonellis99]

[Palopoli et al. 98]

CUPID [Madhavan et al. 01]

PROMPT [Noy et al. 00]

SEMINT [Li&Clifton94]

ILA [Perkowitz&Etzioni95]

DELTA [Clifton et al. 97]

Learners + rules, use multi-strategy learning

Exploit schema + data

1-1 + complex mapping

Exploit domain constraints

Rules

Exploit data

1-1 + complex mapping

CLIO [Miller et. al., 00]

[Yan et al. 01]

LSD [Doan et al., SIGMOD-01]

COMAP [Doan et al. 2002, submitted]

GLUE [Doan et al., WWW-02]

future work
Future Work
  • Learning source descriptions
    • formal semantics for mapping
    • query capabilities, source schema, scope, reliability of data, ...
  • Dealing with changes in source description
  • Matching objects across sources
  • More sophisticated user feedback
  • Focus on distributed information management systems
    • data integration, web-service integration, peer data management
    • goal: significantly reduce complexity of construction & maintenance
conclusions
Conclusions
  • Efficiently creating semantic mappings is critical
  • Developed solution for semi-automatic schema matching
    • learns from previous matching activities
    • can match relational schemas, DTDs, ontologies, ...
    • discovers both 1-1 & complex mappings
    • highly modular & extensible
    • achieves high matching accuracy
  • Made contributions to machine learning
    • developed novel method to classify XML data
    • extended relaxation labeling
training the meta learner
Training the Meta-Learner
  • For address

Name Learner

Naive Bayes

True Predictions

Extracted XML Instances

<location> Miami, FL</>

<listed-price> $250,000</>

<area> Seattle, WA </>

<house-addr>Kent, WA</>

<num-baths>3</>

...

0.5 0.8 1

0.4 0.3 0

0.3 0.9 1

0.6 0.8 1

0.3 0.3 0

... ... ...

Least-SquaresLinear Regression

Weight(Name-Learner,address) = 0.1

Weight(Naive-Bayes,address) = 0.9

sensitivity to amount of available data
Sensitivity to Amount of Available Data

Average matching accuracy (%)

Number of data listings per source (Real Estate I)

contribution of each component
Contribution of Each Component

Average Matching Acccuracy (%)

Without Name Learner

Without Naive Bayes

Without Whirl Learner

Without Constraint Handler

The complete LSD system

exploiting hierarchical structure
Exploiting Hierarchical Structure
  • Existing learners flatten out all structures
  • Developed XML learner
    • similar to the Naive Bayes learner
      • input instance = bag of tokens
    • differs in one crucial aspect
      • consider not only text tokens, but also structure tokens

<contact>

<name> Gail Murphy </name>

<firm> MAX Realtors </firm>

</contact>

<description>

Victorian house with a view. Name your price!

To see it, contact Gail Murphy at MAX Realtors.

</description>

reasons for incorrect matchings
Reasons for Incorrect Matchings
  • Unfamiliarity
    • suburb
    • solution: add a suburb-name recognizer
  • Insufficient information
    • correctly identified general type, failed to pinpoint exact type
    • agent-name phoneRichard Smith(206) 234 5412
    • solution: add a proximity learner
  • Subjectivity
    • house-style = description?Victorian Beautiful neo-gothic houseMexican Great location
evaluate mapping candidates
Evaluate Mapping Candidates
  • For address, Text Searcher returns
    • (agent-id,0.7)
    • (concat(agent-id,city),0.8)
    • (concat(city,zipcode),0.75)
  • Employ multi-strategy learning to evaluate mappings
  • Example: (concat(agent-id,city),0.8)
    • Naive Bayes Learner: 0.8
    • Name Learner: “address” vs. “agent id city” 0.3
    • Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65
  • Meta-Learner returns
    • (agent-id,0.59)
    • (concat(agent-id,city),0.65)
    • (concat(city,zipcode),0.70)
relaxation labeling
Relaxation Labeling
  • Applied to similar problems in
    • vision, NLP, hypertext classification

People

Dept U.S.

Dept Australia

Courses

Courses

Courses

Courses

People

Staff

Faculty

Staff

Acad.Staff

Tech.Staff

Faculty

Staff

relaxation labeling for taxonomy matching
Relaxation Labeling for Taxonomy Matching
  • Must define
    • neighborhood of a node
    • k features of neighborhood
    • how to combineinfluence of features
  • Algorithm
    • init: for each pair <N,L>, compute
    • loop: for each pair <N,L>, re-compute

Acad. Staff: Faculty

Tech. Staff: Staff

Staff = People

Neighborhood configuration

relaxation labeling for taxonomy matching1
Relaxation Labeling for Taxonomy Matching
  • Huge number of neighborhood configurations!
    • typically neighborhood = immediate nodes
    • here neighborhood can be entire graph100 nodes, 10 labels => configurations
  • Solution
    • label abstraction + dynamic programming
    • guarantee quadratic time for a broad range of domain constraints
  • Empirical evaluation
    • GLUE system [Doan et. al., WWW-02]
    • three real-world domains
    • 30 -- 300 nodes / taxonomy
    • high accuracy 66 -- 97% vs. 52 -- 83% of best base learner
    • relaxation labeling very fast, finished in several seconds