Imap discovering complex semantic matches between database schemas
This presentation is the property of its rightful owner.
Sponsored Links
1 / 63

iMAP: Discovering Complex Semantic Matches Between Database Schemas PowerPoint PPT Presentation


  • 39 Views
  • Uploaded on
  • Presentation posted in: General

iMAP: Discovering Complex Semantic Matches Between Database Schemas. Ohad Edry January 2009 Seminar in Databases. Motivation. Consider a union of databases of two banks. We need to generate a mapping between the schemas. Bank A tables. Bank B tables. Introduction.

Download Presentation

iMAP: Discovering Complex Semantic Matches Between Database Schemas

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Imap discovering complex semantic matches between database schemas

iMAP: Discovering Complex Semantic Matches Between Database Schemas

Ohad Edry

January 2009

Seminar in Databases


Motivation

Motivation

  • Consider a union of databases of two banks.

  • We need to generate a mapping between the schemas

Bank A tables

Bank B tables


Introduction

Introduction

  • Semantic mappings specify the relationships between data stored in disparate sources.

    • A mapping between attribute of target schema to attributes of source schema According to the semantics


Motivation example continue

Motivation – Example continue

Bank A tables

Bank B tables


Motivation example continue1

Motivation – Example continue

Semantic Mapping!

Bank A tables

Bank B tables


Introduction1

Introduction

  • Most of the work in this field focused on Matching Process.

  • The types of matches can be split to 2:

    • 1 – 1 matching.

    • Complex matching – Combination of attributes in one schema corresponds to a combination in other schema

  • Match Candidate– each matching of attributes from source and target schemas.


Motivation example continue2

Motivation – Example continue

Semantic Mapping!

Bank A tables

Bank B tables

1-1 matching candidate

Complex matching candidate


Introduction examples

Introduction - examples:

  • Example 1:

  • Example 2:

Company A

Company B


Introduction examples1

Introduction - examples:

  • Example 1:

  • Example 2:

Company A

Company B


Introduction examples2

Introduction - examples:

  • Example 1:

  • Example 2:

Company A

Company B


Introduction examples3

Introduction - examples:

  • Example 1:

    1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.

  • Example 2:

Company A

Company B


Introduction examples4

Introduction - examples:

  • Example 1:

    1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.

  • Example 2:

Company A

Company B

Product Price = Price*(1-Discount)


Difficulties in generating matchings

Difficulties in Generating Matchings

  • Difficult to find the matches because

    • Finding complex matches is not trivial at all

      • How the system should know:

        Product Price = Price*(1-Discount)

    • The number of candidates for Complex Matches is large.

    • Sometimes tables should be joined:

Product Price = Price*(1-Discount)


Main parts of the imap system

Main Parts of the iMAP System

  • Generating Matching Candidates

  • Pruning matching candidates

    • By exploiting Domain Knowledge

  • Explaining Match Predictions

    • Provides an explanation to selected predicted matching

    • Causes the system to be semi automatically.


Imap system architecture

iMAP System Architecture

  • Consists three main modules:

    • Match Generator – generates the matching candidates using special searchers for target schema and source schema.

    • Similarity Estimator – generates matrix that stores the similarity score of pairs (target attribute, match candidate)

    • Match Selector – examines the score matrix and outputs the best matches under certain conditions.


Imap system architecture cont

iMAP System Architecture – cont.

Match Selector: receives similarity matrix and output final match candidates

Similarity Estimator: receives match candidates and outputs similarity matrix

To each attribute t of T iMAP generates match candidates from S


Part 1 match generation searchers

Part 1: Match Generation - searchers

  • The key in match generation is to SEARCH through the space of possible match candidates.

    • Search space – all attributes and data in source schemas

  • Searchers work based on knowledge of operators and attributes types such as numeric, textual and some heuristic methods.


The internal of searchers

The Internal of Searchers

  • Search Strategy

    • Facing the large space using the standard beamsearch.

  • Match Evaluation

    • Giving score which approximates the distance between the candidate and the target.

  • Termination Condition

    • Search should be stopped because of a large search space.


The internal of searchers example

The Internal of Searchers – Example

  • i Iterations which limited by k results:

MAXi

  • Product Price = Price*(1-Discount)

  • Product Price = Product ID

  • k. …

MAXi+1

Stop: MAXi-MAXi+1<delta

Return first k candidates


The internal of searchers join paths

The Internal of Searchers – Join Paths

  • Find matches in Join Paths in two steps:

Company A

Company B

Product Price = Price*(1-Discount)

Second Step –search process use the join paths

First Step -Join paths between tables: Join(T1,T2)


Implemented searchers in imap

Implemented searchers in iMAP

  • Contains the following searchers:

    • Text

    • Numeric

    • Category

    • Schema Mismatch

    • Unit Conversion

    • Date

    • Overlap versions of Text, Numeric, Category, Schema Mismatch, Unit Conversion


Implemented searchers text searcher example

Implemented searchers – Text Searcher example

  • Text searcher:

    Purpose: finds matching candidates that are concatenations of text attributes.

    Method:

  • Target attribute: Name

  • Search Space: attributes in source

    Schemas which have textual properties

  • Searcher search in the Search Space

    attributes or concatenations of attributes


Implemented searchers numeric searcher example

Implemented searchers – Numeric Searcher example

  • Numeric Searcher :

    Purpose: best matches for numeric attributes.

  • Issues:

    • Compute the similarity score of complex match

      • Value distribution

    • Type of matches

      • +,-,*,/

      • 2 Columns

dim1*dim2=size


Implemented searchers in imap cont

Implemented searchers in iMAP – cont.

  • Category Searcher:

    Purpose: find matches between categorical attributes in the source and in the schema.

  • Schema Mismatch Searcher:

    Purpose: relating the data of a schema with the schema of the other. Occurs very often.

  • Unit Conversion Searcher:

    Purpose: find matches between different types of units.

  • Date Searcher:

    Purpose: finds complex matches for date attributes.


Part 2 similarity estimator

Part 2: Similarity Estimator

  • Receives from the Match Generator candidate matches which based on the score that each searcher assigns.

  • Problem: each searcher can give different score

  • Solution: Final score, more accurate, to each match by using additional types of information.

    • iMAP system uses evaluator modules:

      • Name-based evaluator – computes score basing on similarity of names

      • Naive Bayes evaluator

Why not to perform this phase

during the search phase?

Very Expensive!


Module example naive bayes evaluator

Module example - Naive Bayes evaluator

  • Consider the mach

    agent-address = location

  • Building model: Data instance in target attribute will be positive otherwise the data will be negative

  • Naïve Bayes Classifier learn the model

  • Applied the trained classifier on the source attribute data

  • Each data instance receive score

  • Return an average on all score as result


Part 3 match selector

Part 3: Match Selector

  • Receives from the Similarity Estimator the scored suggested for matching candidates

  • Problem: These matches may violate certain domain integrity constraints.

    • For example: mapping 2 source attributes to the same target attributes.

  • Solution: set of domain constraints

    • Defined by domain experts or users


Constraint example

Constraint Example

  • Constraint: Price and Club members price are unrelated

  • Match Selector delete this match candidate

Match Selector receives list of candidates:

k. Product Price = Price+club members price


Exploiting domain knowledge

Exploiting Domain Knowledge

  • iMAP system uses 4 different types of knowledge:

    • Domain Constraints

    • Past matches

    • Overlap data

    • External data

  • iMAP uses its knowledge at all levels of the system and early as it can in match generation.


Types of knowledge

Types of knowledge

  • Domain constraints

    • Three cases:

      • Name and ID are unrelated - Attributes from the Source schema are unrelated

        • searchers

      • Account < 10000 - Constraint on single attribute t

        • Similarity Estimator and Searchers

      • Account and ID are unrelated - Attributes from the Target Schema are unrelated

        • Match Selector

Source:

Target:


Types of knowledge cont

Types of knowledge – cont.

  • Past Complex Matches

    • Numeric Searcher can use past expression template:

      • Price=Price*(1-Discount) generates

        VARIABLE*(1-VARIABLE)

  • External Data – using external sources for learning about attributes and their data.

    • Given a target attribute and useful feature of that attribute, iMAP learn about value distribution

      • Example: number of cities in state


Types of knowledge cont1

Types of knowledge – cont.

  • Overlap Data – Provide information for the mapping process.

    • contains searchers which can exploit overlap data.

  • Overlap Text, Category & Schema Mismatch searchers

    • S and T share a state listing

    • Matches: city=state , country=state

    • Re-evaluating results: city=state is 0 and country=state is 1

  • Overlap Numeric Searcher – using the overlap data and using equation discovery system (LAGRMGE) the best arithmetic expression for t is found.


Generating explanations

Generating Explanations

  • One goal is to provide design environment which the user will inspect the matches predicted by the system, modified them manually and the system will have a feedback.

  • The system uses complex algorithms so it needs to explain the user the matches.

  • Explanations are good for the user as well

    • Correct matches quickly

    • Tells the system where its mistake.


Generating explanations so what do you want to know about the matches

Generating Explanations – so, what do you want to know about the matches?

  • iMAP system defines 3 main questions:

    • Explain the existing match – why a certain match X is presented in the output of iMAP? Why the match survive the all process?

    • Explain absent match - why a certain match Y is not presented in the output of iMAP?

    • Explain match ranking – why match X is ranked higher than match Y?

  • Each of these questions can be asked for each module of iMAP.

    • Question can be reformulated recursively to underlying components.


Generating explanations example

Generating Explanations - Example

  • Suppose we have 2 real-estate schemas:

  • iMAP produces the ranked matches:

    • (1) List-price=price*(1+monthly-fee-rate)

    • (2) List-price=price

iMAP explanation: both matches were generated by the numeric searcher and the similarity estimator also agreed to the ranking.


Generating explanations example1

Generating Explanations - Example

  • Suppose we have 2 real-estate schemas:

  • The current order:

  • List-price=price*(1+monthly-fee-rate)

  • List-price=price

  • Match selector have 2 constraints: (1) month-posted=month-fee-rate, (2) month-posted and price don’t share common attributes

List-price=price match is selected by the match generator


Generating explanations example2

Generating Explanations - Example

  • Suppose we have 2 real-estate schemas:

  • The current order:

  • List-price=price

  • List-price=price*(1+monthly-fee-rate)

  • iMAP explains that the source for month-posted=month-fee-rate is the date searcher

The user correct the iMAP that month-fee-rate is not type of date.


Generating explanations example3

Generating Explanations - Example

  • Suppose we have 2 real-estate schemas:

  • List-price=price*(1+monthly-fee-rate) is again the chosen match

  • The Final order:

  • List-price=price*(1+monthly-fee-rate)

  • List-price=price


Example cont generated dependency graph

Example cont. – generated dependency graph

Dependency Graph is small!!!

Searchers produce only k best matches

iMAP goes through three stages


What do you want to know about the matches

What do you want to know about the matches?

  • Why a certain match X is presented in the output of iMAP?

    • Returns the part in the graph that describes the match.


Example cont generated dependency graph1

Example cont. – generated dependency graph


What do you want to know about the matches1

What do you want to know about the matches?

  • Why a certain match X is presented in the output of iMAP?

    • Returns the part in the graph that describes the match.

  • Why match X is ranked higher than match Y?

    • Return the comparing part in the graph between the 2 matches.


Example cont generated dependency graph2

Example cont. – generated dependency graph


What do you want to know about the matches2

What do you want to know about the matches?

  • Why a certain match X is presented in the output of iMAP?

    • Returns the part in the graph that describes the match.

  • Why match X is ranked higher than match Y?

    • Return the comparing part in the graph between the 2 matches.

  • Why a certain match Y is not presented in the output of iMAP?

    • If the has been eliminated during the process the part that responsible for the eliminating explains why

    • Otherwise the iMAP ask the searcher to check if they can generate the match and to explain why it was not generated


Example cont generated dependency graph3

Example cont. – generated dependency graph


Evaluating imap on real world domains

Evaluating iMAP on real world domains

  • iMAP was evaluated on 4 real-word domains:

  • For the Cricket domain they used 2 independently developed databases

  • For the other 3 they used one real-world source database and target schema which created by volunteers.

  • Databases with overlap domains and databases with disjoint domains


Evaluating imap on real world domains cont

Evaluating iMAP on real world domains – cont.

  • Data Processing: removing data such as “unknown” and adding the most obvious constraints.

  • Experiments: there are actually 8 experimental domains

    • 2 domains for each one – overlap domain and disjoint domain.

  • Performance measure:

    • 1 matching accuracy

    • 3 matching accuracy

    • Complex match

    • Partial complex match


Results 1

Results (1)

Overall and 1-1 matching accuracy:

  • Not in the figure, but according to the article the top-3 accuracy is even higher and iMAP also achieves top-1 and top-3 accuracy of 77%-100% for 1-1 matching

(a) Exploiting domain constraints and overlap data improve accuracy

(b) Disjoint domains achieves lower accuracy than overlap data domains


Results 2

Results (2)

Complex matching accuracy – Top 1 and Top 3:


Results 2 cont

Results (2) – Cont.

Complex matching accuracy – Top 1:

  • Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains

  • (a) Exploiting domain constraints and overlap data improve accuracy

  • (b) iMAP achieves lower accuracy than in overlap data domains

    • No overlap data decreases the accuracy of Numeric Searcher and Text Searcher.


Results 2 complex matches low results

Results (2) – complex matches low results

  • Smaller components – example: apt-number

    • Suggested solution: adding format learning techniques

  • Small noise components – example: agent-id

    • Suggested solution: more aggressive match cleaning and more constraints.

  • Disjoint databases – difficult for numeric searcher

    • Suggested solution: using past numeric matches

  • Top–k – many results are not in top 1

    • Increasing k to 10 will increase accuracy


Results 21

Results (2)

Complex matching accuracy – Top 1 and Top 3:


Results 2 cont1

Results (2) – Cont.

Complex matching accuracy – Top 3:

  • Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains

    • Same reasons as in Top 1

  • (c) Improvement in accuracy compared to (a) when using overlap and constraints

  • This is a outcome of correct complex matches in the top 3 matches


Results 3

Results (3)

Partial Complex matching accuracy – Top 1 and Top 3:


Results 3 cont

Results (3) – cont.

Partial Complex matching accuracy – Top 1 and Top 3:

  • The accuracy is measured in finding only the right attributes

    • For example: wrong numeric template but right attributes

  • Much more accuracy than full complex matching accuracy.

  • Partial Complex Matches can be very useful when the user want to fix wrong matches


Performance efficiency

Performance & Efficiency

Performance:

  • iMAP is stable after 100 data tuples

  • If we run it on fewer examples first we can reduce iMAP running time

Accuracy

Data tupels


Performance efficiency cont

Performance & Efficiency – Cont.

Efficiency:

  • Unoptimized iMAP version ran for 5 – 20 minutes on the experimental domains

  • Several techniques are suggested in the article to improve this time:

    • For example breaking the schemas into independent chunks


Explaining match predictions

Explaining match predictions

  • Example for explaining match prediction:

  • Conclusion: the Name Based evaluator has more influence – last line

  • The user can use this information to reduce the influence of the Name Based evaluator

Searcher Level: Concat(first-name,last-name) was ranked higher than last-name

  • Similarity Estimator:

    • Name based was wrong

    • Naïve Bayes was right

Match Selector: didn’t influence


Related work

Related work

  • L. Xu and D. Embley. Using domain ontologies to discover direct and in direct matches for schema elements:

    • Mapping the schema to domain ontology and searching in this domain.

    • Can be added to as additional searcher

  • Clio System:

    • Sophisticated set of user-interface techniques to improve matches


Conclusions

Conclusions

  • Most of the work in that field until now was about 1-1 matching

  • This article focused on complex matching.

  • iMAP key is the use of:

    • Searchers

    • Domain knowledge

  • Providing the user the possibility to affect the matches


Any questions

Any Questions?


Thank you

Thank you!


Bibliography

Bibliography

  • Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Halevy, Pedro Domingos. iMAP: Discovering Complex Semantic Matches between Database Schemas.

  • http://en.wikipedia.org/wiki/Beam_search


  • Login