Virtual Data Integration

Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Agenda • Terminology • Conjunctive queries and Datalog • Virtual Data Integration Architecture

(Known) Terminology (1) • Relational database composed of a set of relations (or tables) • Database schema includes a relational schema for each table + set of integrity constraints • Key constraints, functional dependencies, foreign key constraints • The rows of a relational table are called tuples (or records) • The NULL value means the values is not known or doesn’t exist • Equality test envolving NULL always returns FALSE • A database instance is a particular snapshot of the contents of the DB

(Known) Terminology (2) • Queries used to formulate users’ needs • Structured (SQL or XQuery) or unstructured (for the Web, e.g., list of keywords) • Q(D): result of applying query Q to the database D • Views are created when we want to reuse the same query expression in other queries • Materialized view: the answer is computed and maintained as the DB changes • Queries can also be used to specify relationships between schemas of data sources in a data integration context • Two different notations for queries: • SQL • Conjunctive queries, for formal purposes

Conjunctive queries • Based on Mathematical Logic • Conjunctive query has the form: Q(X):- R1(X1), ....., Rn(Xn), c1, ..., cm • R1(X1)....Rn(Xn) are the sub-goals (or conjuncts) and form the body of the query • Ris are database relations, Xi’s are tuples of variables and constants • The variables in X are called head variables or distinguished variables; the others are called existencial variables • The predicate Q denotes the answer relation of the query • The ci’s are interpreted atoms and have the form X θ Y, with X, Y either variables or constants (but at least one is a variable), and θ is an interpreted predicate such as =, <=, >, !=, >, >=

Semantics of conjunctive queries • Semantics of a conjunctive query Q over DB instance D: • Ψ: any mapping that maps each of the variables in Q to constants in D • Ψ(Ri): result of applying Ψ to Ri(Xi), which consists of ground atoms (constants) • Ψ(Q): the result of applying Ψ to Q(X), which consists of ground atoms • If • Each of Ψ(R1),...., Ψ(Rn) is in D, and • For each 1<=j<=m, Ψ(cj) is satisfied, • Then, Ψ(Q) is in the answer to Q over D

Correspondence between SQL queries and conjunctive queries • Base relations: Interview(candidate, name, recruiter, hireDecision, grade) EmployeePerformance(empID, name, reviewQuarter, grade, reviewer) • SQL: Select recruiter, candidate From Interview, EmployeePerformance Where recruiter = name And grade < 2.5 • Conjunctive query: Q1(Y,X):- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W <= 2.5

Safety, disjunction • Conjunctive queries must be safe: • Every variable appearing in the head also appears in the body • Otherwise, the set of possible answers to the query may be infinite • Disjunction can be expressed writing two or more conjunctive queries with the same head predicate: Q1(Y,X) :- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W<=2.5 Q1(Y,X) :- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W >=3.9

Negation • Conjunctive queries with negated goals can also be considered: Q(X):- R1(X1), ..., Rn(Xn), ¬S1(Y1),…., ¬Sm(Ym) • The notion of safety is extended to: • Any variable appearing in the head of the query must also appear in a positive subgoal • To produce an answer for the query, the mappings from the variables of Q to the constants in the database must satisfy: Ψ(S1(Y1)),.... Ψ(Sm(Ym))  D

Datalog program • Set of rules, each of which is a conjunctive query • Instead of computing a single answer query, computes a set of intensional relations (IDB relations), one of them is designated the query predicate • Extensional Database (EDB) relations: are the database relations, given by a set of tuples; can occur only in the body of rules • Intensional Database (IDB) relations: are defined by a set of rules; can occur both in the head and in the body of rules

Datalog Terminology body • An example datalog rule: idb(x,y)  r1(x,z), r2(z,y), z < 10 • Irrelevant variables can be replaced by _ (anonymous var) • Extensional relations or database schemas (edbs) are relations only occurring in rules’ bodies, or as base relations: • Ground facts only have constants, e.g., r1(‘abc’, 123) • Intensional relations(idbs) appear in the heads – these are basically views • Distinguished variables are the ones output in the head subgoals head

Example • A database that includes a relation representing the edges in a graph: edge(X,Y) • The Datalog query: r1: path(X,Y) :- edge(X,Y) r2: path(X,Y) :- edge(X,Z), path(Z,Y) computes the paths in the graph • If r2 was replaced by: r3: path(X,Y) :- path(X,Z), path(Z,Y) It would produce the same result

Semantics of a Datalog program • Start with empty extensions for the IDB predicates • Choose a rule in the program and apply it to the current extension of the EDB and IDB relations • Add the tuples computed for the head to its extension • Continue applying the rules of the program until no new tuples are computed for the IDB relations • The answer to the query is the extension of the query predicate

Example (cont) r1: path(X,Y) :- edge(X,Y) r2: path(X,Y) :- edge(X,Z), path(Z,Y) Start with: edge(1,2), edge(2,3), edge(3,4) Apply r1 returns: path(1,2), path(2,3), path(3,4) Apply r2 once: path(1,3), path(2,4) Apply r2 twice: path(1,4)

Datalog is Relationally Complete We can map RA  Datalog: • Selection p: p becomes a datalog subgoal • Projection A: we drop projected-out variables from head • Cross-product r  s: q(A,B,C,D) :- r(A,B),s(C,D) • Join r ⋈ s: q(A,B,C,D) :- r(A,B),s(C,D), condition • Union r U s: q(A,B) :- r(A,B) ; q(C, D) :- s(C,D) • Difference r – s: q(A,B) :- r(A,B), ¬ s(A,B) • Great… But then why do we care about Datalog?

A Query We Can’t Answer in RA Recall our example of a binary relation for graphs or trees (similar to an XML Edge relation): edge(from, to) If we want to know what nodes are reachable: reachable(F, T, 1) :- edge(F, T) dist. 1 reachable(F, T, 2) :- edge(F, X), edge(X, T) dist. 2 reachable(F, T, 3) :- reachable(F, X, 2), edge(X, T) dist. 3 But how about all reachable paths? (Note this was easy in XPath over an XML representation -- //edge)

Recursive Datalog Queries Define a recursive query in datalog: reachable(F, T, 1) :- edge(F, T) distance 1 reachable(F, T, D + 1) :- reachable(F, X, D), edge(X, T) distance >1 What does this mean, exactly, in terms of logic? • There are actually three different (equivalent) definitions of semantics; we will use that of fixpoint: • We start with an instance of data, then derive all immediate consequences • We repeat as long as we derive new facts • In the RA, this requires a (restricted) while loop! • “Inflationary semantics” (which terminates in time polynomial in the size of the database!)

Our Query in RA + while(inflationary semantics, no negation) Datalog: reachable(F, T, 1) :- edge(F, T) reachable(F, T, D+1) :- reachable(F, X, D), edge(X, T) RA procedure with while: reachable += edge ⋈ literal1 while change { reachable += F, T, D(F ! X(edge) ⋈T ! X,D ! D0(reachable) ⋈ add1) } Note literal1(F,1) and add1(D0,D) are actually arithmetic and literal functions modeled here as relations.

A Special Type of Query: Conjunctive Queries A single Datalog rule with no “Ç,” “:,” “8” can express select, project, and join – a conjunctive query • Conjunctive queries are possible to reason about statically (Note that we can write CQ’s in other languages, e.g., SQL!) We know how to “minimize” conjunctive queries An important simplification that can’t be done for general SQL We can test whether one conjunctive query’s answers always contain another conjunctive query’s answers (for ANY instance) • Why might this be useful?

Example of Containment Suppose we have two queries:q1(S,C) :- Student(S,N),Takes(S,C),Course(C, X), inCIS(C),Course(C,’DB & Info Systems’)q2(S,C) :- Student(S,N),Takes(S,C),Course(C,X) Intuitively, q1 must contain the same or fewer answers vs. q2: • It has all of the same conditions, except one extra conjunction (i.e., it’s more restricted) • There’s no union or any other way it can add more data We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}

Agenda • Terminology • Conjunctive queries and Datalog • Virtual Data Integration Architecture

Building a Data Integration System Create a middleware mediator or data integration system over the sources • Can be warehoused (a data warehouse) or virtual • Presents a uniform query interface and schema • Abstracts away multitude of sources; consults them for relevant data • Unifies different source data formats (and possibly schemas) • Sources are generally autonomous, not designed to be integrated • Sources may be local DBs or remote web sources/services • Sources may require certain input to return output (e.g., web forms) • binding patterns describe these

Logical components of a virtual integration system Is built for the data integration application and contains only the aspects of the domain relevant to the application. Most probably will contain a subset of the attributes seen in sources Specify the properties of the sources the system needs to know to use their data. Main component is semantic mappings that specify how attributes in the sources correspond to attributes in the mediated schema, how to resolve differences in how data values are specified in different sources Other info is whether sources are complete Programs whose role is to send queries to a data source, receive answers and possibly apply some basic transformation to the answer

Example: a data integration scenario

Semantic mappings Semantic mappings in the source descriptions describe: • the relationship between the sources and the mediated schema, for example: • mapping of source S1 states it contains movies; the attribute name in Movies maps to attribute title in the mediator Movie relation; the Actors relation in the mediated schema is a projection of the Movies source on the attributes name and actors • Whether the sources are complete, for example: • Source S2 may not contain all the movie showing times in the entire country • Can specify limited access patterns to the sources, e.g.,: • All the playing times sources require a movie title as input

Components of a data integration system

Query reformulation (1) • Rewrite the user query that was posed in terms of the relations in the mediated schema, into queries referring to the schemas of data sources • Result is called a logical query plan: • Set of queries that refer to the schemata of the data sources and whose combination will yield the answer to the original query

Query reformulation (2) • Ex: SELECT title, startTime FROM Movie, Plays WHERE Movie.title = Plays.movie AND location= ‘New York’ AND director= ‘Woody Allen’ • Tuples for Movie can be obtained from source S1 but attribute title needs to be reformulated to name • Tuples for Plays can be obtained from S2 or S3. Since S3 is complete for showings in NY, we choose it • Since source S3 requires the title of a movie as input and the title is not specified in the query, the query plan must first access S1 and then feed the movie titles returned from S1 as inputs to S3.

Query optimization • Acepts a logical query plan as input and produces a physical query plan • Specifies the exact order in which sources are accessed, when results are combined, which algorithms are used for performing operations on the data • Ex: • The optimizer will decide the join algorithm to combine results from S1 and S3. It may stream tuples from S1 and input them into S3, or it may batch them up before sending them to S3

Query execution • Responsible for the execution of the physical query plan • Dispatches the queries to the individual sources through the wrappers and combines the results as specified by the query plan. • Also may ask the optimizer to reconsider its plan based on its monitoring of the plan’s progress (e.g., if source S3 is slow)

Referências • Chapter 2, Draft of the book on “Principles of Data Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation). • Slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives) • Raghu Ramakrishnan and Johannes Gehrke, “Database Management Systems”, 3rd edition, McGraw-Hill • A. Silberchatz et al, “Database System Concepts”, 5th edition, McGraw-Hill • S. Abiteboul et al, “Foundations of Databases”, Addison Weley, 1995

Virtual Data Integration