eTuner: Tuning Schema Matching Software using Synthetic Scenarios

eTuner: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA

Main Points • Tuning matching systems: long standing problem • becomes increasingly worse • We propose a principled solution • exploits synthetic input/output pairs • promising, though much work remains • Idea applicable to other contexts

Schema Matching price agent-name address 120,000 George Bush Crawford, TX 239,900 Hillary Clinton New York City, NY Schema 1 1-1 match complex match listed-price contact-name city state Schema 2 320K Jane Brown Seattle WA 240K Mike Smith Miami FL

Schema Matching is Ubiquitous • Databases • data integration, • model management • data translation, • collaborative data sharing • keyword querying, schema/view integration • data warehousing, peer data management, … • AI • knowledge bases, ontology merging, information gathering agents, ... • Web • e-commerce, Deep Web, Semantic Web • eGovernment, bio-informatics, scientific data management

Current State of Affairs • Finding semantic mappings is now a key bottleneck! • largely done by hand, labor intensive & error prone • Numerous matching techniques have been developed • Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ... • AI: Stanford, Karlsruhe University, NEC Japan, ... • Techniques are often synergistic, leading to multi-component matching architectures • each component employs a particular technique • final predictions combine those of the components

An Example: LSD [SIGMOD-01] agent name Schema 1 address agent-name 0.5 Name Matcher contact agent Urbana, IL James Smith Seattle, WA Mike Doan Combiner Schema 2 0.1 0.3 Naive Bayes Matcher area contact-agent Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243 area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Constraint Enforcer Match Selector area = address contact-agent = agent-phone ... comments = desc Only one attribute of Schema 2 matches address

Match selector Match selector Match selector Constraint enforcer Constraint enforcer Constraint enforcer Combiner Matcher Combiner Combiner Matcher 1 Matcher n … Multi-Component Matching Solutions • Such systems are very powerful ... • maximize accuracy; highly customizable to individual domain • ... but place a serious tuning burden on domain users • Developed in many recent works • e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02; Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05 • Now commonly adopted, with industrial-strength systems • e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig] Match selector Combiner Matcher … Matcher 1 Matcher n … Matcher 1 Matcher n LSD COMA SF LSD-SF

Match selector Constraint enforcer Combiner … Matcher 1 Matcher n Tuning Schema Matching Systems • Given a particular matching situation • how to select the right components? • how to adjust the multitude of knobs? Knobs of decision tree matcher Bipartite graph selector Threshold selector • Characteristics of attr. A* search enforcer Relax. labeler ILP •Split measure Average combiner Min combiner Max combiner Weighted sum combiner •Post-prune? •Size of validation set • q-gram name matcher Decision tree matcher Naïve Bays matcher • • TF/IDF name matcher SVM matcher Execution graph Library of matching components • Untuned versions produce inferior accuracy, however ...

... Tuning is Extremely Difficult • Large number of knobs • e.g., 8-29 in our experiments • Wide variety of techniques • database, machine learning, IR, information theory, etc. • Complex interaction among components • Not clear how to compare the quality of knob configs • Matching systems are still tuned manually, by trial and error • Multiple component systems make tuning even worse Developing efficient tuning techniques is crucial to making matching systems attractive in practice

The eTuner Solution • Given schema S & matching system M • tunes M to maximize average accuracy of matching S with future schemas • incurs virtually no cost to user • Key challenge 1: Evaluation • must search for “best” knob config • how to compute the quality of any knob config C? • if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C • but often have no such W • Key challenge 2: Search • how to efficiently evaluate the huge space of knob configs?

Key Idea: Generate Synthetic Input/Output Pairs • Need workload W = {(S,T1), (S,T2), …, (S,Tn)} • To generate W • start with S • perturb S to generate T1 • perturb S to generate T2 • etc. • Know the perturbation => know matches between S & Ti

3 3 3 3 12 12 12 12 Schema S 1 3 2 Key Idea: Generate Synthetic Input/Output Pairs V V1 1 Perturb # of tables 3 2 Perturb # of columnsin each table . Split S into V and U with disjoint data tuples . . EMPLOYEES Vn Perturb column and table names EMPLOYEES Perturb data tuples in each table U EMPS 1 3 2 EMPLOYEES EMPS EMPLOYEES EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) Ω1: a set of semantic matches V1 U

Examples of Perturbation Rules • Number of tables • merge two tables based on a join path • splits a table into two • Structure of table • merges two columns • e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name) • drop a column • swap location of two columns • Names of tables/columns • rules capture common name transformations • abbreviation to the first 3-4 characters, dropping all vowels, synonyms, dropping prefixes, adding table name to column name, etc • Data values • rules capture common format transformations: 12/4 => Dec 4 • values are changed based on some distributions (e.g., Gaussian) See paper for details

The eTuner Architecture Tuning Procedures Perturbation Rules Workload Generator Staged Tuner Synthetic Workload Tuned Matching Tool M UΩ1 V1 UΩ2 V2 UΩn Vn Matching Tool M Schema S (Optional)

Match selector Constraint enforcer Combiner … Matcher 1 Matcher n The Staged Tuner • Tune sequentially starting with lowest-level components • Assume • execution graph has k levels, m nodes per level • each node can be assigned one of n components • each component has p knobs, each of which has q values tuning examines (npqkm) out of (npq)^(km) knob configs Level 4 Level 3 Tuning direction Level 2 Level 1

Empirical Evaluation Domains Matchingsystems

Matching Accuracy Domain-dependent eTUNER: Automatic Off-the-shelf Source-dependent eTUNER: Human-assisted Domain-independent 0.9 0.9 0.8 COMA LSD 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course 0.9 0.9 SF 0.8 LSD-SF 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course eTuner achieves higher accuracy than current best methods, at virtually no cost to the user

Cost of Using eTuner • You have a schema S and a matching system M • Vendor supplies eTuner • will hook it up with matching system M • Vendor supplies a matching system M • bundles eTuner inside

Inventory Domain Real Estate Domain Average Sensitivity Analysis • Adding perturbation rules • Exploiting prior match results (enriching the workload) 0.7 0.9 0.8 0.6 0.7 0.5 0.6 0.4 0.5 Accuracy (F1) 0.4 0.3 0.3 Tuned LSD 0.2 0.2 0.1 0.1 0.0 0.0 1 10 20 25 40 50 0 22 44 66 88 Schemas in Synthetic Workload (#) Previous matches in collection (%)

Summary: The eTuner Project @ Illinois • Tuning matching systems is crucial • long standing problem, is getting worse • a next logical step in schema matching research • Provides an automatic & principled solution • generates a synthetic workload, employs it to tune efficiently • incurs virtually no cost to human users • exploits user assistance whenever available • Extensive experiments over 4 domains with 4 systems • Future directions • find optimal synthetic workload • apply to other matching scenarios • adapt ideas to scenarios beyond schema matching (see 3rd speaker)

Backup: User Assistance • S(phone1,phone2,…) • Generate V by dropping phone2: V(phone1,…) • Rename phone1 in V: V(x,…) • Problem: • x matches phone1, x does not match phone2 • User: • group phone1 and phone2 • so if x matches phone1, it will also match phone2 • Intuition: tell system do not bother to try distinguish phone1 and phone2

eTuner: Tuning Schema Matching Software using Synthetic Scenarios