iMAP : Discovering Complex Semantic Matches between Database Schemas

iMAP: Discovering Complex Semantic Matchesbetween Database Schemas
Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign Alon Halevy, Pedro Domingos University of Washington

Semantic Heterogeneity Semantics refer to the meaning of data in contrast to syntax, which only defines the structure of the schema items (e.g., classes and attributes). Semantic heterogeneity is a general term referring to disagreement about the meaning, interpretation or intended use of the same or related data. One of the toughest problems in Data Integration.

Schema Matching A match between two schemas specifies semantic correspondences between elements of both schemas. A complex match specifies that a combination of attributes in one schema corresponds to a combination in the other.

Aim Examine an unbounded number of match candidates for complex matching between data sources. Describe an iMAPsystem which semi-automatically discovers both 1-1 and complex matches for relational data (database schemas in this paper). The system uses following steps: Generating Matches Exploiting Domain Knowledge Explaining Match Prediction

Content An architecture for semi-automatically discovering complex matches that combines search through a set of candidate matches and methods for evaluating each match in isolation, and a set of matches as a whole. Uses of new kinds of domain knowledge (overlap data and mining external data), and applying the knowledge as early as possible in the matching process. A mechanism for explaining the decisions made by the matching system. The iMAP system which embodies all these innovations, and a set of experiments on real-world.

Problem Definition consider the two relational schemas S and T in Figure 1. Both databases store house listings and are managed by two di®erent real-estate companies. The schema of database T, for example, has one table, LISTINGS, whereas database S stores its data in two tables, HOUSES and AGENT. Suppose the two real-estate companies have decided to merge. To cut costs, they eliminate database S by transferring all house listings from S to database T.

Table T location listed-price agent-name Denver, CO 550,000 Laura Smith Atlanta, GA 370,800 Mike Brown agent-id city state price ($) Athens GA 360,000 32 Raleigh NC 360,000 15 Semantic Mapping Specify relationships between schema elements of disparate data sources. Example: id agent commission 32 Mike Brown 0.03 15 Jean Laup 0.04 Table S1 Table S2

Creating Semantic Mappings Target T location listed-price agent-name Denver, CO 550,000 Laura Smith Atlanta, GA 370,800 Mike Brown 32 15 agent-id city state price ($) id agent commission Mike Brown 0.03 Athens GA 360,000 32 Jean Laup 0.04 Raleigh NC 360,000 15 Table S2 Table S1 Step1: Generating matches location=concat(city,state) listed-price=price*(1+commission) agent-name=agent Step2: Elaborating matches into mappings location = SELECT concat(city,state) FROM S1 listed-price = SELECT price*(1+commission) FROM S1, S2 WHERE agent-id = id agent-name = SELECT agent FROM S2

Generating Matches 1-1 matches agent-name=agent Complex matches location=concat(city,state) listed-price=price*(1+commission) The experiments in this paper contain 30-50% of complex matches.

Illustrative Example sim(agent-name,agent)=0.95 sim(location,city)=0.76 sim(listed-price,price)=0.86 … … sim(location,concat(city,state))=0.96 sim(listed-price,price(1+commision))=0.97 … … location=city listed-price=price agent-name=agent … … … location=concat(city,state) listed-price=price(1+commision) … … Table T Table S location listed-price city state price agent commission agent-name location = concat(city,state) listed-price=price*(1+commission) agent-name=agent Finding the best global assigment Estimating similarity “location” and “listed-price” are unrelated Generating candidate matches “city” and “price” are unrelated

iMAP Architecture 1-1 and complex matches User Domain knowledge and data Match selector Explanation module Similaritymatrix Similarity estimator Match candidates Match generator Target schema T and source schema S

Match Generator The space of candidate matches is huge location=city listed-price=price location=state listed-price=commission … … location=concat(city,state) listed-price=price*(1+commision) location=concat(city,agent-name) listed-price=price*(1-commision) … … listed-price=concat(city,state) location=price*(1+commision) listed-price=concat(city,agent-name) location=price*(1-commision) … … Solution: multi-searcher strategy employing a set of special-purpose searchers each searcher examines only a specific portion of search space e.g., text searcher, numeric searcher, … iMAP is highly extensible new searchers can be added as developed

… … Default Search Strategy Use beam search at each search level, keep k best match candidates Example: Find all concatenation candidate matches forlocation (k=3) city state price ($) agent-name commission … concat(city,state) concat(city,price) concat(city,agent-name)

Default Search Strategy … … Use beam search at each search level, keep k-best match candidates Example: Find all concatenation candidate matches for location (k=3) city state price ($) agent-name commission … concat(city,price) concat(city,agent-name) concat(city,state) … concat(city,state,price) concat(city,state,agent-name)

Scoring Functions Each searcher has its own scoring function. can use machine learning, statistics, heuristics, etc. E.g., current text searcher uses Naive Bayes classification. To compute score between location & concat(city,state) build a Naive Bayes classifier for location. Use classifier to compute how similar each instance of concat(city,state) to location average similarity value is the desired score.

Termination Condition Best candidate match at the 2nd iteration Best candidate match at the 3rd iteration | | f(concat(city,state)) f(concat(city,state)) - <= 0.03 threshold scoringfunction Even with beam search, search space can still be huge or infinite need to find effective termination condition Currently: terminate when see diminishing returns

Implemented Searchers

Exploiting Domain Knowledge Exploit new kinds of domain knowledge - overlap data and external data Overlap data: data from different sources that represent a same entity tuples from real estate listings that represent same house External data: data not in the two input schemas “# of real estate agents <= 50” mined from real estate listings Use domain knowledge as early as possible

Generating Explanations Motivating example 2: Motivating example 1: num-rooms=bath-rooms + bed-rooms + dining-rooms: 0.434 num-rooms=bath-rooms + bed-rooms + living-rooms: 0.432 … Correct match: num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms name=last-name: 0.434 name=concat(first-name,last-name): 0.420 … Correct match: name=concat(first-name,last-name)

Types of Questions Users Can Ask Existence of a match why is month-posted=monthly-fee-rate generated? Non-existence of a match why is num-rooms= bath-rooms+bed-rooms+dining-rooms+ living-rooms not generated? Ranking of a match why is name=last-name ranked higher than name=concat(first-name,last-name)?

Generating Explanation using Dependency Graph Match List list-price=price list-price=price(1 + monthly-fee-rate) Match selector Constraint Candidate Candidate Candidate list-price = price score = 0.63 list-price = price(1+ monthly-fee-rate) score = 0.76 month-posted is unrelated to list-price month-posted = monthly-fee-rate score = 0.67 Combining module Candidate Candidate month-posted = monthly-fee-rate score = 0.55 month-posted = monthly-fee-rate score = 0.79 Name based evaluator Naïve Bayes evaluator Candidate month-posted = monthly-fee-rate Date searcher Assumption Assumption month-posted is a month. monthly-fee-rate is a month Preprocessor Preprocessor Constraint Source Column Target Column If data in a column is in (1..12) it’s a month month-posted monthly-fee-rate 3 2 … 9 10 …

Sample Explanation USER: Why num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms does not appear? iMAP: (1) Overlap numeric matcher CAN generate bath-rooms + bed-rooms + dining-rooms + living-rooms for num-rooms. (2) Overlap numeric matcher DID NOT generateit. (3) A reason: the match has length of 4 terms. overlap numeric searcher has considered only candidates of length up to 3 terms. (4) Characteristics of the search space for num-rooms: a. Number of considered numeric attributes: 7 b. Considered numeric attributes: building-area lot-dimension1 lot-dimension2 bath-rooms bed-rooms dining-rooms living-rooms c. Used successor functions: Addition Multiplication d. Max. number of terms: 3 e. Max. number of elements in a term: 3

Empirical Evaluation Domains & data sources four domains: Real Estate, Inventory, Cricket, and Financial Wizard avg. # of matches: 30 obtain data from several sources Internet, Microsoft Access sample databases, students in a large database class Methodology and performance measure run experiments with several configurations on eight experiment sets top-1 matching accuracy = % of target attributes whose top-1 matches are correct

Complex Matching Accuracy Default system: 33 - 55% + Domain constraints: 50 - 55% + Overlap data: 33 - 71% + Domain constraints + Overlap data: 50 - 86% 1-1 matching accuracy: 77-100% Complex matching accuracy: 50 - 86% Matching accuracy (%)

Complex Matching Accuracyfor Disjoint Domains Matching accuracy (%) Default system: 18 - 58% + Domain constraints: 27 - 58% Complex matching accuracy: 27 - 58%

Why not 100%? Hard to find smaller components of a complex match agent-address = concat(street-name,city,state), where apt-number is missed Hard to remove small noise components from a complex match phone-number = concat(agent-id,area-code,number), where agent-id is noise (a single digit number) Many correct complex matches are in the top three, but not in the top one

Summary & Future Work Complex matches are pervasive finding them is crucial for practical matching solutions Described the iMAP solution discovers both complex and 1-1 matches is highly extensible exploits several kinds of domain knowledge can explain produced matches Experiments show the promise of iMAP. Future work apply iMAP to other data representations (e.g., XML) combine iMAP-style techniques with Clio-style user interaction

QUESTIONS?

iMAP : Discovering Complex Semantic Matches between Database Schemas