Implementing Mapping Composition

Implementing Mapping Composition Todd J. Green* University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research), Alan Nash (UC San Diego) VLDB 2006 Seoul, Korea *Work partially supported by NSF grants IIS0513778 and IIS0415810

Schema mappings • Mapping: a correspondence between instances of different schemas Names SID, Name Students Name, Address m Addresses SID, Address S1 S2 Students Name,Address(Names ⋈ Addresses)

Names SID, Name Names SID, Name Local SID, Address Students Name, Address, Country Addresses SID, Address, Country Foreign SID, Address, Country Applications of mappings Names  Names σCountry = KR(Addresses)  SID,Address(Local)£{KR} σCountry  KR(Addresses)  Foreign • Schema evolution Students Name,Address,Country(Names ⋈ Addresses) ... m12 m23 S1 S2 S3

Applications of mappings • Data integration, data exchange Sn Addresses SID, Address, Country Names SID, Name ... m1 mn Students Name,Address (Names ⋈ Addresses) NamesNames Local SID,Address(Country = KR(Addresses)) Foreign Country KR(Addresses) S1 Sn−1 Foreign SID, Address, Country Students Name, Address, Country Names SID, Name Local SID, Address ...

Requirements for constraints • “First attribute in R is a key for R” 2,4(R ⋈1=3 R) µ2,2(R) • “View V equals R joined with S” VµR ⋈ S, V¶R ⋈ S • “Second attribute of R is a foreign key in S” 2(R) µ1(S) 2,4(S ⋈1=3 S) µ2,2(S) • Data integration, data exchange – GLAV R ⋈ S µT ⋈ U

Names Names σCountry = KR(Addresses)  SID,Address(Local)£{KR} σCountry  KR(Addresses)  Foreign Students Name,Address, Country (Names⋈ (SID,Address(Local)£{KR} [ Foreign)) Names SID, Name Students Name,Address,Country(Names⋈Addresses) Names SID, Name Local SID, Address Students Name, Address, Country m12 m12  m23 m23 Addresses SID, Address, Country Foreign SID, Address, Country S2 Mapping composition S1 S3

Composition is hard • Hard part: write composition in the same language as the input mappings. Depending on language: • Not always possible • Not even decidable whether possible • Strategy 1: use powerful (second-order) mapping language closed under composition [FKPT04] • Not supported by DBMS today • Expensive to check • Source-target restriction • Strategy 2: settle for partial solutions [NBM05] • Containment mappings  easier integration with DBMS • The strategy we adopt in this work

Our contributions • New algorithm for composition problem • Incorporates view unfolding and left-composition (new technique) • Makes best effort in failure cases • Algebraic rather than logic-based mappings • Use of monotonicity to handle more operators • Modular and extensible factoring of algorithm • First implementation of composition • Experimental evaluation

U(∙,∙,∙) S(∙,∙) V(∙,∙) m12 m23 R(∙,∙,∙) T(∙,∙) W(∙,∙) S ⊆(U), T= V – W S1 S2 S3 R⊆ S⋈T Formal definition of composition • Mapping: set of pairs of instances of db schemas • The composition m12 ± m23 is the mapping {hA,Ci : (9B)(hA,Bi2m12 and hB,Ci2m23)} where A,B,C are instances of S1,S2,S3 • Composition problem: find constraints in same language as input mappings giving the composition of the input mappings • Example: S1 = {R}, S2 = {S,T}, S3 = {U,V,W} R⊆S⋈T, S⊆(U), T=V – W ) R⊆(U)⋈(V - W)

Best-effort composition problem • Composition not always possible • “Best-effort” composition problem: compute set of constraints equivalent to input constraints, but with as many symbols from S2 eliminated as possible R⊆U, R⊆V, 1,4(2=3(UU))⊆U, 1,4(2=3(VV))⊆V, U⊆T, V⊆T Can eliminate U (cross out left column) or V(right column), but not both [NBM05]

Composition algorithm overview For each relation Rin S2 • Try to eliminate R via (1)view unfolding Replace = by pairs of ⊆, ⊇ For each relation R in S2 not yet eliminated • Try to eliminate Rvia (2)left compose • Else, try to eliminate R via (3)right compose Output: New constraints and list of relations successfully eliminated

(1) View unfolding • Idea: exploit equality constraints (if we have any) • Standard technique: substitute view definition for occurrences of view relation in mappings T=V – W, R⊆S ⋈T, TX⊆(U) R⊆S ⋈(V – W), (V – W) X⊆(U) • Body must not mention view relation itself • Doesn’t matter what else is in body • Can substitute everywhere

(2) Left compose • “View unfolding” for containment constraints (V) ⊆R – U, R⊆S ⋈ T  (V) ⊆ (S ⋈T) – U • Needs monotonicity of expressions in R. E1⊆E2(R), R⊆E3´E1⊆E2(E3) if E2(R) is monotone in R(and R not in E3) • Partial check for monotonicity “Is S – (T – R) monotone in R?”

Normalization for left compose • Need one constraint of form R⊆E1 • Use identities to normalize, e.g.: • R⊆E1 and R⊆E2 iff R⊆E1E2 • E1E2⊆E3 iff E1⊆E3 and E2⊆E3 • (E1) ⊆E2 iff E1⊆E2Dr • More identities in paper • After left compose, try to eliminate D

(3) Right compose • Dual to left compose, from [NBM05] • Example: S ⋈TR, R – U(V)  (S ⋈T) – U (V) • Monotonicity check needed here too • Normalization may introduce Skolem functions • E1(E2) iff f(E1) E2 • Must eliminate Skolem functions after composition • Lots of effort coding this step!

User-defined operators • User specifies: • Monotonicity of operator in its arguments “If E1 monotone in R and E2 antimonotone in R or independent of R, then E1 * E2 monotone in R” “if E1 monotone in R or independent of R and E2 antimonotone in R, then E1 * E2 monotone in R” • Identities for normalization “E1 * E2E3 iff E1E2E3 ” • User-defined operators and standard relational operators treated uniformly

Implementation • 12K lines of C# code, command-line tool # Test case 13: PODS05 example 2 SCHEMA R(2), S(2), T(2) CONSTRAINTS R <= S, P_{0,2} J_{0,1:1,2} (S S) <= R, S <= T ELIMINATE S; Output: P_{0,2} J_{0,1:1,2}(R R) <= R, R <= T

Experimental evaluation • First attempt at a composition benchmark • Schema editing and schema reconciliation scenarios • “Add a column to R to produce S”: (R) = S • Measure • % of symbols eliminated • Running time • As a function of • Editing primitives allowed, length of edit sequence, presence/absence of keys, starting schema size, … • Synthetic data

Summary of results • Algorithm often effective in eliminating most or even all relation symbols from S2 • Running time in subsecond range even for large problems containing hundreds of constraints • Certain schema editing primitives problematic • Key constraints did not reduce effectiveness, although did increase running time (and output size)

Schema editing • Random starting schema (30 relations of 2-10 attributes) • 100 random edits • 100 different runs, sorted by execution time

Schema reconciliation (1) • Random schema (30 relations of 2-10 attributes), random edits • Point represents median time of reconciliation step of 500 runs

Schema reconciliation (2) • Random schema (variable # relations of 2-10 attributes) • 100 random edits • 100 different runs, sorted by execution time

Related work • [MH03] J. Madhavan, A. Y. Halevy. Composing mappings among data sources. VLDB, 2003. • [FKPT04] R. Fagin, Ph. G. Kolaitis, L. Popa, W.C. Tan. Composing schema mappings: second-order dependencies to the rescue. PODS, 2004. • [NBM05] A. Nash, P. A. Bernstein, S. Melnik. Composition of mappings given by embedded dependencies. PODS, 2005.

Conclusion and future work • We motivated and described the mappingcomposition problem • We presented an implementation of a practical new algorithm for the composition problem • We also presented an experimental evaluation • To do: theoretical analysis of impact of user-defined operators • To do: output constraints from algorithm can be a mess! How to clean up?

Implementing Mapping Composition