Imap discovering complex semantic matches between database schemas
Download
1 / 63

iMAP: Discovering Complex Semantic Matches Between Database Schemas - PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on

iMAP: Discovering Complex Semantic Matches Between Database Schemas. Ohad Edry January 2009 Seminar in Databases. Motivation. Consider a union of databases of two banks. We need to generate a mapping between the schemas. Bank A tables. Bank B tables. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' iMAP: Discovering Complex Semantic Matches Between Database Schemas' - george


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Imap discovering complex semantic matches between database schemas

iMAP: Discovering Complex Semantic Matches Between Database Schemas

Ohad Edry

January 2009

Seminar in Databases


Motivation
Motivation Schemas

  • Consider a union of databases of two banks.

  • We need to generate a mapping between the schemas

Bank A tables

Bank B tables


Introduction
Introduction Schemas

  • Semantic mappings specify the relationships between data stored in disparate sources.

    • A mapping between attribute of target schema to attributes of source schema According to the semantics


Motivation example continue
Motivation – Example continue Schemas

Bank A tables

Bank B tables


Motivation example continue1
Motivation – Example continue Schemas

Semantic Mapping!

Bank A tables

Bank B tables


Introduction1
Introduction Schemas

  • Most of the work in this field focused on Matching Process.

  • The types of matches can be split to 2:

    • 1 – 1 matching.

    • Complex matching – Combination of attributes in one schema corresponds to a combination in other schema

  • Match Candidate– each matching of attributes from source and target schemas.


Motivation example continue2
Motivation – Example continue Schemas

Semantic Mapping!

Bank A tables

Bank B tables

1-1 matching candidate

Complex matching candidate


Introduction examples
Introduction - examples: Schemas

  • Example 1:

  • Example 2:

Company A

Company B


Introduction examples1
Introduction - examples: Schemas

  • Example 1:

  • Example 2:

Company A

Company B


Introduction examples2
Introduction - examples: Schemas

  • Example 1:

  • Example 2:

Company A

Company B


Introduction examples3
Introduction - examples: Schemas

  • Example 1:

    1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.

  • Example 2:

Company A

Company B


Introduction examples4
Introduction - examples: Schemas

  • Example 1:

    1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.

  • Example 2:

Company A

Company B

Product Price = Price*(1-Discount)


Difficulties in generating matchings
Difficulties in Generating Matchings Schemas

  • Difficult to find the matches because

    • Finding complex matches is not trivial at all

      • How the system should know:

        Product Price = Price*(1-Discount)

    • The number of candidates for Complex Matches is large.

    • Sometimes tables should be joined:

Product Price = Price*(1-Discount)


Main parts of the imap system
Main Parts of the iMAP System Schemas

  • Generating Matching Candidates

  • Pruning matching candidates

    • By exploiting Domain Knowledge

  • Explaining Match Predictions

    • Provides an explanation to selected predicted matching

    • Causes the system to be semi automatically.


Imap system architecture
iMAP System Architecture Schemas

  • Consists three main modules:

    • Match Generator – generates the matching candidates using special searchers for target schema and source schema.

    • Similarity Estimator – generates matrix that stores the similarity score of pairs (target attribute, match candidate)

    • Match Selector – examines the score matrix and outputs the best matches under certain conditions.


Imap system architecture cont
iMAP System Architecture – cont. Schemas

Match Selector: receives similarity matrix and output final match candidates

Similarity Estimator: receives match candidates and outputs similarity matrix

To each attribute t of T iMAP generates match candidates from S


Part 1 match generation searchers
Part 1: Match Generation - searchers Schemas

  • The key in match generation is to SEARCH through the space of possible match candidates.

    • Search space – all attributes and data in source schemas

  • Searchers work based on knowledge of operators and attributes types such as numeric, textual and some heuristic methods.


The internal of searchers
The Internal of Searchers Schemas

  • Search Strategy

    • Facing the large space using the standard beamsearch.

  • Match Evaluation

    • Giving score which approximates the distance between the candidate and the target.

  • Termination Condition

    • Search should be stopped because of a large search space.


The internal of searchers example
The Internal of Searchers – Example Schemas

  • i Iterations which limited by k results:

MAXi

  • Product Price = Price*(1-Discount)

  • Product Price = Product ID

  • k. …

MAXi+1

Stop: MAXi-MAXi+1<delta

Return first k candidates


The internal of searchers join paths
The Internal of Searchers – Join Paths Schemas

  • Find matches in Join Paths in two steps:

Company A

Company B

Product Price = Price*(1-Discount)

Second Step –search process use the join paths

First Step -Join paths between tables: Join(T1,T2)


Implemented searchers in imap
Implemented searchers in iMAP Schemas

  • Contains the following searchers:

    • Text

    • Numeric

    • Category

    • Schema Mismatch

    • Unit Conversion

    • Date

    • Overlap versions of Text, Numeric, Category, Schema Mismatch, Unit Conversion


Implemented searchers text searcher example
Implemented searchers – Text Searcher example Schemas

  • Text searcher:

    Purpose: finds matching candidates that are concatenations of text attributes.

    Method:

  • Target attribute: Name

  • Search Space: attributes in source

    Schemas which have textual properties

  • Searcher search in the Search Space

    attributes or concatenations of attributes


Implemented searchers numeric searcher example
Implemented searchers – Numeric Searcher example Schemas

  • Numeric Searcher :

    Purpose: best matches for numeric attributes.

  • Issues:

    • Compute the similarity score of complex match

      • Value distribution

    • Type of matches

      • +,-,*,/

      • 2 Columns

dim1*dim2=size


Implemented searchers in imap cont
Implemented searchers in iMAP – cont. Schemas

  • Category Searcher:

    Purpose: find matches between categorical attributes in the source and in the schema.

  • Schema Mismatch Searcher:

    Purpose: relating the data of a schema with the schema of the other. Occurs very often.

  • Unit Conversion Searcher:

    Purpose: find matches between different types of units.

  • Date Searcher:

    Purpose: finds complex matches for date attributes.


Part 2 similarity estimator
Part 2: Similarity Estimator Schemas

  • Receives from the Match Generator candidate matches which based on the score that each searcher assigns.

  • Problem: each searcher can give different score

  • Solution: Final score, more accurate, to each match by using additional types of information.

    • iMAP system uses evaluator modules:

      • Name-based evaluator – computes score basing on similarity of names

      • Naive Bayes evaluator

Why not to perform this phase

during the search phase?

Very Expensive!


Module example naive bayes evaluator
Module example - Naive Bayes evaluator Schemas

  • Consider the mach

    agent-address = location

  • Building model: Data instance in target attribute will be positive otherwise the data will be negative

  • Naïve Bayes Classifier learn the model

  • Applied the trained classifier on the source attribute data

  • Each data instance receive score

  • Return an average on all score as result


Part 3 match selector
Part 3: Match Selector Schemas

  • Receives from the Similarity Estimator the scored suggested for matching candidates

  • Problem: These matches may violate certain domain integrity constraints.

    • For example: mapping 2 source attributes to the same target attributes.

  • Solution: set of domain constraints

    • Defined by domain experts or users


Constraint example
Constraint Example Schemas

  • Constraint: Price and Club members price are unrelated

  • Match Selector delete this match candidate

Match Selector receives list of candidates:

k. Product Price = Price+club members price


Exploiting domain knowledge
Exploiting Domain Knowledge Schemas

  • iMAP system uses 4 different types of knowledge:

    • Domain Constraints

    • Past matches

    • Overlap data

    • External data

  • iMAP uses its knowledge at all levels of the system and early as it can in match generation.


Types of knowledge
Types of knowledge Schemas

  • Domain constraints

    • Three cases:

      • Name and ID are unrelated - Attributes from the Source schema are unrelated

        • searchers

      • Account < 10000 - Constraint on single attribute t

        • Similarity Estimator and Searchers

      • Account and ID are unrelated - Attributes from the Target Schema are unrelated

        • Match Selector

Source:

Target:


Types of knowledge cont
Types of knowledge – cont. Schemas

  • Past Complex Matches

    • Numeric Searcher can use past expression template:

      • Price=Price*(1-Discount) generates

        VARIABLE*(1-VARIABLE)

  • External Data – using external sources for learning about attributes and their data.

    • Given a target attribute and useful feature of that attribute, iMAP learn about value distribution

      • Example: number of cities in state


Types of knowledge cont1
Types of knowledge – cont. Schemas

  • Overlap Data – Provide information for the mapping process.

    • contains searchers which can exploit overlap data.

  • Overlap Text, Category & Schema Mismatch searchers

    • S and T share a state listing

    • Matches: city=state , country=state

    • Re-evaluating results: city=state is 0 and country=state is 1

  • Overlap Numeric Searcher – using the overlap data and using equation discovery system (LAGRMGE) the best arithmetic expression for t is found.


Generating explanations
Generating Explanations Schemas

  • One goal is to provide design environment which the user will inspect the matches predicted by the system, modified them manually and the system will have a feedback.

  • The system uses complex algorithms so it needs to explain the user the matches.

  • Explanations are good for the user as well

    • Correct matches quickly

    • Tells the system where its mistake.


Generating explanations so what do you want to know about the matches
Generating Explanations – so, what do you want to know about the matches?

  • iMAP system defines 3 main questions:

    • Explain the existing match – why a certain match X is presented in the output of iMAP? Why the match survive the all process?

    • Explain absent match - why a certain match Y is not presented in the output of iMAP?

    • Explain match ranking – why match X is ranked higher than match Y?

  • Each of these questions can be asked for each module of iMAP.

    • Question can be reformulated recursively to underlying components.


Generating explanations example
Generating Explanations - Example about the matches?

  • Suppose we have 2 real-estate schemas:

  • iMAP produces the ranked matches:

    • (1) List-price=price*(1+monthly-fee-rate)

    • (2) List-price=price

iMAP explanation: both matches were generated by the numeric searcher and the similarity estimator also agreed to the ranking.


Generating explanations example1
Generating Explanations - Example about the matches?

  • Suppose we have 2 real-estate schemas:

  • The current order:

  • List-price=price*(1+monthly-fee-rate)

  • List-price=price

  • Match selector have 2 constraints: (1) month-posted=month-fee-rate, (2) month-posted and price don’t share common attributes

List-price=price match is selected by the match generator


Generating explanations example2
Generating Explanations - Example about the matches?

  • Suppose we have 2 real-estate schemas:

  • The current order:

  • List-price=price

  • List-price=price*(1+monthly-fee-rate)

  • iMAP explains that the source for month-posted=month-fee-rate is the date searcher

The user correct the iMAP that month-fee-rate is not type of date.


Generating explanations example3
Generating Explanations - Example about the matches?

  • Suppose we have 2 real-estate schemas:

  • List-price=price*(1+monthly-fee-rate) is again the chosen match

  • The Final order:

  • List-price=price*(1+monthly-fee-rate)

  • List-price=price


Example cont generated dependency graph
Example cont. – generated dependency graph about the matches?

Dependency Graph is small!!!

Searchers produce only k best matches

iMAP goes through three stages


What do you want to know about the matches
What do you want to know about the matches? about the matches?

  • Why a certain match X is presented in the output of iMAP?

    • Returns the part in the graph that describes the match.



What do you want to know about the matches1
What do you want to know about the matches? about the matches?

  • Why a certain match X is presented in the output of iMAP?

    • Returns the part in the graph that describes the match.

  • Why match X is ranked higher than match Y?

    • Return the comparing part in the graph between the 2 matches.



What do you want to know about the matches2
What do you want to know about the matches? about the matches?

  • Why a certain match X is presented in the output of iMAP?

    • Returns the part in the graph that describes the match.

  • Why match X is ranked higher than match Y?

    • Return the comparing part in the graph between the 2 matches.

  • Why a certain match Y is not presented in the output of iMAP?

    • If the has been eliminated during the process the part that responsible for the eliminating explains why

    • Otherwise the iMAP ask the searcher to check if they can generate the match and to explain why it was not generated



Evaluating imap on real world domains
Evaluating iMAP on real world domains about the matches?

  • iMAP was evaluated on 4 real-word domains:

  • For the Cricket domain they used 2 independently developed databases

  • For the other 3 they used one real-world source database and target schema which created by volunteers.

  • Databases with overlap domains and databases with disjoint domains


Evaluating imap on real world domains cont
Evaluating iMAP on real world domains – cont. about the matches?

  • Data Processing: removing data such as “unknown” and adding the most obvious constraints.

  • Experiments: there are actually 8 experimental domains

    • 2 domains for each one – overlap domain and disjoint domain.

  • Performance measure:

    • 1 matching accuracy

    • 3 matching accuracy

    • Complex match

    • Partial complex match


Results 1
Results (1) about the matches?

Overall and 1-1 matching accuracy:

  • Not in the figure, but according to the article the top-3 accuracy is even higher and iMAP also achieves top-1 and top-3 accuracy of 77%-100% for 1-1 matching

(a) Exploiting domain constraints and overlap data improve accuracy

(b) Disjoint domains achieves lower accuracy than overlap data domains


Results 2
Results (2) about the matches?

Complex matching accuracy – Top 1 and Top 3:


Results 2 cont
Results (2) – Cont. about the matches?

Complex matching accuracy – Top 1:

  • Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains

  • (a) Exploiting domain constraints and overlap data improve accuracy

  • (b) iMAP achieves lower accuracy than in overlap data domains

    • No overlap data decreases the accuracy of Numeric Searcher and Text Searcher.


Results 2 complex matches low results
Results (2) – complex matches low results about the matches?

  • Smaller components – example: apt-number

    • Suggested solution: adding format learning techniques

  • Small noise components – example: agent-id

    • Suggested solution: more aggressive match cleaning and more constraints.

  • Disjoint databases – difficult for numeric searcher

    • Suggested solution: using past numeric matches

  • Top–k – many results are not in top 1

    • Increasing k to 10 will increase accuracy


Results 21
Results (2) about the matches?

Complex matching accuracy – Top 1 and Top 3:


Results 2 cont1
Results (2) – Cont. about the matches?

Complex matching accuracy – Top 3:

  • Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains

    • Same reasons as in Top 1

  • (c) Improvement in accuracy compared to (a) when using overlap and constraints

  • This is a outcome of correct complex matches in the top 3 matches


Results 3
Results (3) about the matches?

Partial Complex matching accuracy – Top 1 and Top 3:


Results 3 cont
Results (3) – cont. about the matches?

Partial Complex matching accuracy – Top 1 and Top 3:

  • The accuracy is measured in finding only the right attributes

    • For example: wrong numeric template but right attributes

  • Much more accuracy than full complex matching accuracy.

  • Partial Complex Matches can be very useful when the user want to fix wrong matches


Performance efficiency
Performance & Efficiency about the matches?

Performance:

  • iMAP is stable after 100 data tuples

  • If we run it on fewer examples first we can reduce iMAP running time

Accuracy

Data tupels


Performance efficiency cont
Performance & Efficiency – Cont. about the matches?

Efficiency:

  • Unoptimized iMAP version ran for 5 – 20 minutes on the experimental domains

  • Several techniques are suggested in the article to improve this time:

    • For example breaking the schemas into independent chunks


Explaining match predictions
Explaining match predictions about the matches?

  • Example for explaining match prediction:

  • Conclusion: the Name Based evaluator has more influence – last line

  • The user can use this information to reduce the influence of the Name Based evaluator

Searcher Level: Concat(first-name,last-name) was ranked higher than last-name

  • Similarity Estimator:

    • Name based was wrong

    • Naïve Bayes was right

Match Selector: didn’t influence


Related work
Related work about the matches?

  • L. Xu and D. Embley. Using domain ontologies to discover direct and in direct matches for schema elements:

    • Mapping the schema to domain ontology and searching in this domain.

    • Can be added to as additional searcher

  • Clio System:

    • Sophisticated set of user-interface techniques to improve matches


Conclusions
Conclusions about the matches?

  • Most of the work in that field until now was about 1-1 matching

  • This article focused on complex matching.

  • iMAP key is the use of:

    • Searchers

    • Domain knowledge

  • Providing the user the possibility to affect the matches


Any questions

Any Questions? about the matches?


Thank you

Thank you! about the matches?


Bibliography
Bibliography about the matches?

  • Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Halevy, Pedro Domingos. iMAP: Discovering Complex Semantic Matches between Database Schemas.

  • http://en.wikipedia.org/wiki/Beam_search


ad