1 / 1

Tuning using Synthetic Workload

Match selector. Match selector. Constraint enforcer. Constraint enforcer. eTUNER: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan, and Arnon S. Rosenthal University of Illinois @ Urbana & MITRE. Combiner. Combiner. …. Matcher 1.

ingrid
Download Presentation

Tuning using Synthetic Workload

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Match selector Match selector Constraint enforcer Constraint enforcer eTUNER: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan, and Arnon S. Rosenthal University of Illinois @ Urbana & MITRE Combiner Combiner … Matcher 1 Matcher n Bipartite graph selector Threshold selector • Characteristics of attr. Decision tree matcher A* search enforcer •Split measure Average combiner Min combiner Max combiner Weighted sum combiner •Post-prune? Domain constraints •Size of validation set Constraint Handler • q-gram name matcher Decision tree matcher Naïve Bays matcher • • Meta-Learner TF/IDF name matcher SVM matcher Base-Learner1 .... Base-Learnerk Execution graph (G) Library of matching components (L) Collection of knobs (k) Source schema & Target schema 3 3 3 3 12 12 12 12 Schema S 1 3 2 Tuning Procedures Transformation Rules … Matcher 1 Matcher n Workload Generator Staged Tuner Source Schema S Synthetic Workload Tuned Matching Tool M* = (L, G, k*) Domain-dependent eTUNER: Automatic Off-the-shelf Source-dependent eTUNER: Human-assisted Domain-independent Matching Tool M = (L, G, k) 0.9 0.9 Augmented Schema 0.8 iCOMA LSD 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course 0.9 0.9 SimFlood 0.8 LSD-SF 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 User 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course eTUNER achieves higher accuracy than current best methods, at virtually no cost to the user Schema Matching Systems Modeling Schema Matching Systems Tuning Schema Matching Systems • Schema Matching • Finding semantic matches between the schemas of disparate data sources • Applications: data warehousing, scientific collaboration, e-commerce, bioinformatics, data integration on WWW, … • Current Trends • Manually finding matches is labor intensive • Numerous automatic matching techniques have been developed • Each technique has its own strength and weakness • Hence, most current matching systems adopt a multi-component strategy • Each component employs a particular matching technique • Highly extensible and customizable • Example: LSD, COMA, GLUE, [Embley02], SimFlood, iMAP, ProtoPlasm, … Matching tool M (L, G, k) Given a particular matching situation, how to select the right matching components to execute, and how to adjust the multipleknobs of the components? • L: Library of matching components • (e.g. matchers, combiners, filters, etc.) • G: Execution graph • k: Collection of control variables (i.e. “knobs”) • Tuning is necessary to get high matching accuracy • Crucial in many applications: automatic data exchange, data integration, peer-to-peer systems, … • Tuning is extremely difficult • Huge space of knobs • Wide variety of matching techniques • Complex interactions among the components • No reasonable guideline for tuning Example: LSD (L, G, k) Developing efficient techniques for tuning is now crucial! Generating Synthetic Workload Formalization of Tuning Problem The eTUNER Archietecture • Generate synthetic workload • Tune a matching system M using the synthetic workload and tuning procedures stored in the repository • Exploit user assistance to generate an even higher quality synthetic workload, if possible V V1 Exploiting user assistance - Grouping semantically equivalent attributes over S - Adding domain specific perturbation rules • General tuning problem • Given • M: a schema matching tool • Workload: a set of matching scenarios (S1,T1), (S2,T2), …, (Sk,Tk) • U: a utility function defined over the process of matching two schemas • Find the knob configuration k*maximizing the utility over the workload 1 Perturb # of tables 3 2 Perturb # of columnsin each table . Split S into V and U with disjoint data tuples . . EMPLOYEES Vn Perturb column and table names EMPLOYEES • Our tuning problem • Given • M: a schema matching tool • S: a source schema • Workload: a set of matching scenarios (S,T1), (S,T2), …, (S,Tk), • (The Tis are future schemas) • U: matching accuracy • Find the knob configuration k*maximizing the average accuracy Perturb data tuples in each table U EMPS 1 3 2 EMPLOYEES EMPS EMPLOYEES EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) Ω1: a set of semantic matches V1 U Experimental Results Tuning using Synthetic Workload Summary & Future Work • Efficient tuning is extremely important • Our contributions • Establish that tuning matching systems automatically is feasible • Synthesize workload to estimate the quality of a matching system with given knob configurations • Establish that staged tuning is a reasonable optimization technique • Experiment extensively over 4 real-world domains with 4 matching systems • Future Work • Explore better search methods and more extensive evaluation • Deploy the idea of using synthetic input/output pairs to other applications • (e.g. wrapper maintenance) • Staged Tuning Level 4 Level 3 Tuning direction Level 2 Level 1 • Tune sequentially starting from the lowest-level components • Find best knob configuration for a component based on matching accuracy over the synthetic workload

More Related