Data Exchange: Semantics and Query Answering

Data Exchange: Semantics and Query Answering Ronald Fagin -- IBM Almaden Research Center Phokion G. Kolaitis -- UC Santa Cruz Renee J. Miller -- University of Toronto Lucian Popa -- IBM Almaden Research Center IBM Almaden - November 12, 2002 (To appear in ICDT 2003)

Motivation and Overview • Data exchange problem: • How to restructure data from a source schema to a target schema, according to a given specification • Main motivation for this work: • Understanding of fundamental issues that lie underneath data exchange systems such as EXPRESS and Clio • Main challenge: • Inherent under-specification: • Specification (as we shall see) must be simple and intuitive, but • There are many ways in which the restructuring can be performed ! • Question: did we make the right choice in the design of Clio ? • Our approach (only relational case, so far): • Define and study universal solutions • Show this is the “best” way of performing data exchange • Study computational aspects • Study what happens after data exchange: query answering

The Data Exchange Problem Target schema T Source schema S t st Assume a data exchange setting: • source schemaS, • target schemaTwith a set tdependencies(see next) • set stsource-to-target dependencies(see next) • The data exchange problem is the following: Input: • source instanceI Output: • target instanceJsuch that:<I, J> standJ t (call suchJa solution forI) I J

Source-to-target Dependencies • For most practical purposes, st contains: • source-to-targettuple-generating dependencies (tgds) : S(x)  y T(x, y) e.g. DeptEmp(did, mgr_name, eid) M. Dept (did, M, mgr_name)  Emp (eid, did) (Move data from source table DeptEmp into two target tables, Dept and Emp. The existential variable M is an “unspecified” manager id) Dept did mgr_id mgr_name DeptEmp did mgr_name eid Emp eid did

Target Dependencies • The second, equally important, part of the specification, are the target dependencies t: • tgds : T(x)  y T(x, y) e.g. Dept (did, mgr_id, mgr_name) D. Emp (mgr_id, D) (A foreign key constraint in the target) • equality generating dependencies (egds): T(x)  (x1=x2) e.g. Emp (e, d1)  Emp (e, d2)  (d1 = d2) (A target key constraint) Dept did mgr_id mgr_name Emp eid did

Questions (To be Answered Next) • When more than one solution exists, how do we choose a “best” one ? • How do we compute a “best” solution ? • Is there always a solution ? Is there always a “best” solution ? • How does query answering on the chosen solution behave ?

Universal Solutions = “Best” Solutions

J1 T a0 b0 c0 <a0 b’0 c’0> P(a,b,c)  YZ.T(a,Y,Z) J T h1= {Y0 -> b0, Z0 -> c0, … } a0Y0 Z0 X0 b0U0 V0 W0 c0 Q(a,b,c)  XU.T(X,b,U) <a’’0 b0 c’’0> J2 h2 T a0 b0Z1 V1 W1c0 R(a,b,c)  VW.T(V,W,c) <a’’’0 b’’’0 c0> Existence of Multiple Solutions source target X0 , Y0 , Z0 … represent “unknown” values (or “nulls”) P A B C • There may be manysolutions for the target instance (J, J1, J2, etc.) • However,J seems to be more general: • there exist homomorphismsh1: J  J1 andh2: J  J2 (see definition next) • but none fromJ1 or J2toJ • intuitively,J1andJ2have extra information T Q A B C A B C R A B C

J1 T a0 b0 c0 J T a0Y0 Z0 X0b0U0 V0 W0 c0 h1= {Y0 -> b0, Z0 -> c0, … } Homomorphisms • As we have seen, the values of a target instance can be either: • constants (i.e. values coming from the source instance), or • nulls (unknown values) • Definition.Assume J1 and J2 are such target instances. A homomorphismh: J1 -> J2 is a mapping from values of J1 to values of J2 such that: • h(c) = c, for constants c (nulls of J1 can be mapped to any values of J2) • for every tuple <a1, …, an> in relation T of instance J1: < h(a1), … h(an) > must be a tuple in relation T of instance J2 • Example:

J T a0Y0 Z0 X0b0U0 V0 W0 c0 Universal Solution Definition. Assume a data exchange setting(S, T, st, t). Given source instance I, a universal solution forIis a target instance Jsuch that: (1)J is a solution for I (2) for every solutionJ’ for I, there exists homomorphismh: J  J’ • For the previous example, J is a universal solution. J1 and J2 are not. • Among all solutions, universal solutions are special: • They contain no more and no less than the amount of information given by the specification

Fact: • Uniqueness up to homomorphic equivalence: • If J1 and J2 are universal for I then there are homomorphisms between J1 and J2 in both directions • Representation of the space of solutions: • Sol(I1) = Sol(I2) iff J1 and J2 are homomorphically equivalent • We adopt the universal solution as the notion of “best” solution. • Later we will see another justification for universal solutions in terms of query answering.

When do universal solutions exist ? • How do we compute a universal solution ?

Computing Universal Solutions

Added in a first chase step (M0 is a null) J < CS M0 Mary > I CS Mary E003 < E003 CS > < M0 D0 > Added in a second chase step st : DeptEmp(did, mgr_name, eid) -> M. Dept (did, M, mgr_name)  Emp (eid, did) t : Dept (did, mgr_id, mgr_name) -> D.Emp (mgr_id, D) Chase • We canonically generate a universal solution by using the chase: • Given source instance I, start with an empty target instance J • Generate tuples in J by applying the dependencies in standt. • Example: Dept did mgr_id mgr_name DeptEmp did mgr_name eid Emp eid did

This process is repeatedly applied: • for all the source tuples and for the generated tuples, • as long as there are dependencies that are not yet satisfied • The chase may be infinite (cyclic t ) … • … or it may fail (e.g. target key constraints that are not possible to satisfy for the given source data) • (details in the paper) • However, if the chase successfully terminates, the resulting target instance is a solution.

Canonical Generation of Universal Solutions • Theorem.Assume a data exchange setting(S, T, st, t). Given source instance I: • If the chase is finite and successful then its result is a universal solution. • If the chase fails then there is no solution. • Thus, the chase is a procedure for computing universal solutions, provided that: • Solutions exist, and • The chase is finite • We call universal solutions computed by the chase canonical universal solutions When can we guarantee that the chase is finite ?

Weakly Acyclic Sets of Dependencies • Some cyclic sets of dependencies may cause infinite chases • In such case no universal solution may exist, and the semantics of the data exchange is undefined • Still there are cyclic sets of dependencies that behave well and are quite useful • Weakly acyclic sets of dependencies (defined in the paper): • Cover many practical cases of target constraints • Allow for restricted cyclicity • The chase is guaranteed to be finite

Polynomial-Time Chase Theorem. Let be a weakly acyclic set of dependencies. For every instance K, the chase of K with  can be computed in polynomial time. • Corollary.Assume a data exchange setting(S, T, st, t)such that tis a weakly acyclic set of dependencies. • For every source instance I, the existence of a solution can be checked in polynomial time • For every source instance I, if a solution exists then a universal solution can be produced in polynomial time.

Next: what happens after data exchange ? • In particular, how is subsequent query answering affected by our choice of a solution (universal solution) ?

Query Answering

Query Evaluation on a Solution q Target schema T Source schema S t st • Assume a fixed data exchange setting with a source instance I. Suppose that a System 1 chooses a solution J for data exchange. • A query q can now be asked against the target. • The evaluation of q, in System 1, is q(J). • However, a System 2 materializing a different solution J’ may give a different evaluation q(J’). • Different choices of J (for the same I) imply possibly different query evaluations. • Is there a notion of the “right” set of answers to qwith respect to I ? I J

Certain Answers • We will use a notion that has been around in the context of data integration and incomplete databases, where queries are asked against a set of possible databases. Definition. GivenI andq, a tupletis a certain answer if: • t  q(J), for every solution J • Notation: certain(q, I)= the set of all certain answers • Thus,tis certain if it is in the answer ofqon every solution. • The certain answers provide well-defined semantics to query answering because they are independent of the choice of a solution.

Can we compute the certain answers based just on our chosen (universal) solution ?

Positive Queries • Proposition.Assume a data exchange setting(S, T, st, t) and a source instance I. • Let q be a positive query. If J is a universal solution, then certain(q, I) = q(J) . • Let J be a solution such that for every positive query q we have that certain(q, I) = q(J) .Then J is a universal solution. Note: In the above: • Positive query means union of SPJ queries • q(J) means evaluate q on J and then throw away tuples that contain nulls) • Thus, the certain answers of positive queries can be computed by evaluating them on any universal solution. • Moreover, this property characterizes universal solutions.

q(u, v) :- xz. T(u, x) T(z, v)  xz a0 X0 Z0 a0 J(universal) R <a0, b0> d1 A B T A B S d2 J2 (not universal) A B a0 a0 <a1, a0> d1: R(a,b)  X.T(a,X) d2: S(a,b)  Z.T(Z,b) Conjunctive Queries with Inequalities • The situation changes when negation is involved (even in the very simple form of ). • Example: • It can be verified that: • <a0,a0> q(J), but<a0,a0>  q(J2)(thus, not a certain answer). • Hencecertain(q, I)  q(J) • The universal solution gives extra answers

For conjunctive queries with inequalities, we have seen that simple query evaluation on a universal solution is not enough for computing the certain answers. • Question: Can we find a different SQL query q* such that when evaluated on a universal solution gives the set of certain answers of q ? • There are examples for which such query q* exists. • However, we show next that the answer is “no”, in general.

Complexity: Two or More Inequalities Theorem.Computing the certain answers of unions of conjunctive queries with at most two inequalities per term is coNP-hard, even in a restricted data exchange setting (LAV). • [AD98] proved a similar result for the case of conjunctive queries with six or more inequalities. • The coNP-hardness implies: • the certain answers cannot be computed by evaluating the query q (or any other SQL query q*) on a polynomial-time generated universal solution (unless P = NP).

Complexity: One Inequality Theorem.Assume a data exchange setting(S, T, st, t)such that tis a weakly acyclic set of dependencies. Let q be a union of conjunctive queries with at most one inequality per term. Let I be a source instance and let J be an arbitrary universal solution for I. Then there exists a polynomial-time algorithm with input Jthat computes certain(q, I). • Thus, computing the certain answers for such queries is a tractable problem. • Moreover, this computation can take place on any universal solution. • The universal solution has all the information needed to compute the certain answers. • We show next that the problem of computing the certain answers, even for this tractable case, cannot be solved by means of SQL query evaluation.

First-Order Inexpressibility Theorem.There exists a data exchange setting and a boolean conjunctive query q with one inequality, for which there is no first-order query q* over the canonical universal solution such that certain(q, I) = q*(J) . • This is a strong inexpressibility result that shows that in data exchange we cannot use the notion of certain answers for answering queries with inequalities. • (In practice, instead of going for certain answers, we should just use query evaluation on the universal solution) • The proof uses an original combination of finite model theory techniques and the chase.

Conclusions • Universal solutions are a good candidate for using in data exchange • Clio produces such universal solution (in the relational case) • All universal solutions are equally good for answering positive queries. • Simple query evaluation has the same semantics as that of the certain answers. • For queries with inequalities, different universal solutions may give different query evaluations which may yet be different from the certain answers. • There is no hope to find the certain answers by means of SQL query evaluation on a universal solution

Future Work • Among all universal solutions, is there a universal solution that approximates, in a best way, the certain answers ? If yes, can this be computed efficiently ? • Extension to semantics and query answering for data exchange in the nested (XML) case.

The End

Relationship to Clio • Source-to-target tgds are the same formalism that Clio uses internally (for the relational case): • st is generated by Clio in the semantic translation phase [VLDB02] from correspondences. • User input • Then Clio generates, based on st , a set of queries in the data translation phase[VLDB02] • These queries compute a solution • Is Clio’s solution a good one ? (Since other solutions are also possible) • Here we try to understand, formally, the concept of “good” solutions • t is more general than the target constraints that Clio can currently handle.

Data Exchange: Semantics and Query Answering