CHAPTER 2: MANIPULATING QUERY EXPRESSIONS. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. Introduction. How does a data integration system decide which sources are relevant to a query? Which are redundant? How to combine multiple sources to answer a query?

Download Presentation

CHAPTER 2: MANIPULATING QUERY EXPRESSIONS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Introduction • How does a data integration system decide which sources are relevant to a query? Which are redundant? How to combine multiple sources to answer a query? • Answer: by reasoning about the contents of data sources. • Data sources are often described by queries / views. • This chapter describes the fundamental tools for manipulating query expressions and reasoning about them.

Outline • Review of basic database concepts • Query unfolding • Query containment • Answering queries using views

Basic Database Concepts • Relational data model • Integrity constraints • Queries and answers • Conjunctive queries • Datalog

Query Answers • Q(D): the set (or multi-set) of rows resulting from applying the query Q on the database D. • Unless otherwise stated, we will consider sets rather than multi-sets.

Integrity Constraints (Keys) • A key is a set of columns that uniquely determine a row in the database: • There do not exist two tuples, t1 and t2 such that t1≠ t2 and t1 and t2have the same values for the key columns. • (EmpID,reviewQuarter) is a key for EmployeePerf

Integrity Constraints (Functional Dependencies) • A set of attribute A functionally determines a set of attributes B if: whenever , t1 and t2 agree on the values of A, they must also agree on the values of B. • For example, (EmpID, reviewQuarter) functionally determine (grade). • Note: a key dependency is a functional dependency where the key determines all the other columns.

Integrity Constraints (Foreign Keys) • Given table T with key B and table S with key A: A is a foreign key of B in T if whenever a S has a row where the value of A is v, then T must have a row where the value of B is v. • Example: the empID attribute of EmployeePerf is a foreign key for attribute emp of Employee.

General Integrity Constraints Tuple generating dependencies (TGD’s) Equality generating dependencies (EGD’s): right hand side contains only equalities. Exercise: express the previous constraints using general integrity constraints.

Conjunctive Queries Q(X,T) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5. Joins are expressed with multiple occurrences of the same variable select recruiter, candidate fromInterview, EmployeePerf where recruiter=name AND grade < 2.5

Conjunctive Queries (interpreted predicates) Q(X,T) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5. Interpreted (or comparison) predicates. Variables must also appear in regular atoms. select recruiter, candidate fromInterview, EmployeePerf where recruiter=name AND grade < 2.5

Conjunctive Queries (negated subgoals) Q(X,T) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), OfferMade(X, date). Safety: every head variable must appear in a positive subgoal.

Unions of Conjunctive Queries Multiple rules with the same head predicate express a union Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5. Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), Manager(y), W > 3.9.

Datalog (recursion) Database: edge(X,Y) – describing edges in a graph. Recursive query finds all paths in the graph. Path(X,Y) :- edge(X,Y) Path(X,Y) :- edge(X,Z), path(Z,Y)

Outline • Review of basic database concepts • Query unfolding • Query containment • Answering queries using views

Query Unfolding • Query composition is an important mechanism for writing complex queries. • Build query from views in a bottom up fashion. • Query unfolding “unwinds” query composition. • Important for: • Comparing between queries expressed with views • Query optimization (to examine all possible join orders) • Unfolding may even discover that the composition of two satisfiable queries is unsatisfiable! (exercise: find such an example).

Query Unfolding Algorithm • Find a subgoal p(X1 ,…,Xn) such that p is defined by a rule r. • Unify p(X1 ,…,Xn) with the head of r. • Replace p(X1 ,…,Xn) with the result of applying the unifier to the subgoals of r (use fresh variables for the existential variables of r). • Iterate until no unifications can be found. • If p is defined by a union of r1, …, rn, create n rules, for each of the r’s.

Query Unfolding Summary • Unfolding does not necessarily create a more efficient query! • Just lets the optimizer explore more evaluation strategies. • Unfolding is the opposite of rewriting queries using views (see later). • The size of the resulting query can grow exponentially (exercise: show how).

Outline • Review of basic database concepts • Query unfolding • Query containment • Answering queries using views

Query Containment: Motivation (2) Furthermore, the query Q5 that requires going through two hubs is contained in Q’3 We need algorithms to detect these relationships.

Query Containment and Equivalence: Definitions Query Q1contained in query Q2 if for every database D Q1(D) Q2(D) Query Q1 is equivalent to query Q2 if Q1(D) Q2(D) and Q2(D) Q1(D) Note: containment and equivalence are properties of the queries, not of the database!

Notation Apology • Powerpoint does not have square • The book uses the square notation for containment, but the slides use the rounded version.

Why Do We Need It? • When sources are described as views, we use containment to compare among them. • If we can remove sub-goals from a query, we can evaluate it more efficiently. • Actually, containment arises everywhere…

Containment: Conjunctive Queries • No interpreted predicates (,) • or negation for now. • Recall semantics: • if maps the body subgoals to tuples in D • then, is an answer.

Example Containment Mapping Q’3(X,Z) :- Flight(X,U), Hub(U), Flight(U,Y), Hub(W), Flight(W,Y), Flight(Y,Z) Q’’3(X,Z) :- Flight(X,U), Hub(U), Flight(U,Y), Flight(Y,Z) Identity mapping on all variables, except:

Theorem[Chandra and Merlin, 1977] Q1 contains Q2 if and only if there is a containment mapping from Q1 to Q2. Deciding whether Q1 contains Q2 is NP-complete.

Proof (sketch) (if): assume exists: Let t be an answer to Q2 over database D (i.e., there is a mapping from Vars(Q2) to D) Consider .

Proof (only-if direction) Assume containment holds. Consider the frozen database D’ of Q2. (variables of Q2 are constants in D’). The mapping from Q1 to D’: containment map.

Frozen Database Q(X,Z) :- Flight(X,U), Hub(U), Flight(U,Y), Flight(Y,Z) Frozen database for Q(X,Z) is: Flight: {(X,U), (U,Y), (Y,Z)} Hub: {(U)}

Two Views of this Result • Variable mapping: • a condition on variable mappings that guarantees containment • Representative (canonical) databases: • We found a (single) database that would offer a counter example if there was one • Containment results typically fall into one of these two classes.

Union of Conjunctive Queries Theorem: a CQ is contained in a union of CQ’s if and only if it is contained in one of the conjunctive queries. Corollary: containment is still NP-complete.

CQ’s with Comparison Predicates A tweak on containment mappings provides a sufficient condition: • : Vars(Q1) Vars(Q2): and

Query Refinements We consider the refinements of Q2 The red containment mapping applies for the first refinement and the blue to the second.

Constructing Query Refinements • Consider all complete orderings of the variables and constants in the query. • For each complete ordering, create a conjunctive query. • The result is a union of conjunctive queries.

Complete Orderings Given a conjunction C of interpreted atoms over a set of variables X1,…,Xn a set of constants a1,…,am CT is a complete ordering if: CT |= C, and

Query Refinements Let CT be a complete ordering of C1 Then: is a refinement of Q1

Theorem: [Queries with Interpreted Predicates][Klug, 88, van der Meyden, 92] Q1 contains Q2 if and only if there is a containment mapping from Q1 to every refinement of Q2. Deciding whether Q1 contains Q2 is In practice, you can do much better (see CQIPContainment Algorithm in Sec. 2.3.4