Leveraging Materialized Views for Query Optimization and Data Integration

CS848 Presentation Heng YU (Henry) h3yu@hopper.uwaterloo.ca

Paper to present Answering queries using views: A survey by A. Y. Halevy VLDB Journal 10: pp. 270-294 2001

Outline • Introduction with examples • Formal problem definitions • Conditions of view usability • Using materialized views in query optimization • Answering queries using views in data integration • Theoretical results • Extensions • Conclusion and challenges

Introduction

Problems (informal) Given a query Q and a set of views V1, .., Vnover a database schema, • Is it possible to answer Qusing only the answers to V1, .., Vn? • What is the maximal set of tuples in the answer of Q that we can get from V1, .., Vn? • If we can access both the views and the database relations, what is the cheapest query execution plan for answering Q?

Fields of applications • Query optimization • Physical data independence • Data integration • More: e.g. semantic cache

Example: a university schema Prof(name, area) Course(c-number, title) Teaches(prof, c-number, quarter) Registered(student, c-number, quarter) Major(student, dept) Works(prof, dept) Advises(prof, student) Keys: Prof(name) Courses(c-cumber) graduate course c-cumber ≥ 400 Ph.D. course c-cumber ≥ 500

Query Optimization Suppose we have a view for graduate course registration info: create view Graduate as select Registered.student, Course.title, Course.c-cnumber, Registered.quarter from Registered.course where Registered.c-number = Course.c-number and Course.c-number ≥ 400

Query optimization (cont.) Want to query students registering in Ph.D. level courses taught by a professor who in interested in DB area: select Registered.student, Course.title from Teaches, Prof, Regestered, Course where Prof.name = Teaches.prof and Teaches.c-number = Register.c-number and Teachers.quarter = Registered.quarter and Registered.c-number = Course.c-number and Course.c-number ≥ 500 and Prof.area = ‘DB’

Queryselect Registered.student, Course.titlefrom Teaches, Prof, Registered, Coursewhere Prof.name = Teaches.prof and Teaches.c-number = Register.c-number and Teachers.quarter = Registered.quarter and Registered.c-number = Course.c-number and Course.c-number ≥ 500 and Prof.area = ‘DB’ View create view Graduate as select Registered.student, Course.title, Course.c-cnumber, Registered.quarter from Registered. Course where Registered.c-number = Course.c-number and Course.c-number ≥ 400

Query optimization (cont.) Result of query rewriting select Graduate.student, Graduate.title from Teachers, Prof, Graduate where Prof.name = Teachers.prof and Teaches.c-number = Graduate.c-cumber and Teaches.quarter = Graduate.quarter and Graduate.c-number ≥ 500 and Prof.area = ‘DB’

Maintaining physical data independence • Relational database systems rely on 1-1 mapping between relations and files. • In object-oriented and semistructured databases, logical model is more redundant and does not reflect optimal physical design. • Physical storage can be described as views over the logical model. e.g. GMAP (Tsatalos et al. 96)

Maintaining physical data independence (cont.) GMAP (generalized multi-level access paths) def.gmap G1 as b+-tree by given Student.name select Department where Student.major Department. def.gmap G2 as b+-tree by given Student.name select Course.c-number where Student registered Course def.gmap G3 as b+-tree by given Course.c-number select Department where Student.registered Course and Student major Department

Maintaining physical data independence (cont.) Query: select Student.name, Department where Student registered Course and Student major Department and Course.c-number ≥ 500 Plans: • PStudent.name, Department (SCourse.c-number≥500 (JStudent.name(G1, G2))) • JCourse.c-number (SCourse.c-number≥50(G3), G2)

Data integration • Providing a uniform query interface to a multitude of autonomous heterogeneous data sources. • Giving users a mediated schema. • Local as View: specifying data source descriptions as a view over the mediated schema.

Data integration (cont.) Example: Prof(name, area) Course(c-number, title, univ) Teaches(prof, c-number, quarter, univ) Register(student, c-number, quarter) Major(student, dept) Works(prof, dept) Advises(prof, student)

Data integration (cont.) Suppose we have only 2 views available: create view DB-courses as select Course.title, Teaches.prof, Course.c-number, Course.univ from Teaches, Course where Teaches.c-number = Course.c-number and Teaches.univ = Course.univ and Course.title = “Database Systems” create view UW-phd-courses as select Course.title, Teaches.prof, Course.c-number,Course.univ from Teaches, Course where Teaches.c-number = Course.c-number and Course.univ = ‘UW’ and Teaches.univ = ‘UW’ and Course.c-number ≥ 500

Data integration (cont.) • Query who teaches database courses in UW: select prof from DB-courses where univ = ‘UW’ • Query all graduate courses in UW: select title, c-number from DB-courses where univ = ‘UW’ and c-number ≥ 400 UNION select title, c-number from UW-phd-courses

Comparison for two applications

Formal Problem Definition

Í Í Í Í Query containment and equivalence Definition A query Q1 is said to be contained in a query Q2, denoted by Q1 Q2, if for all database instances D, the set of tuples computed for Q1 is a subset of those computed for Q2, i.e., Q1(D) Q2(D) ; The two queries are equivalent if Q1 Q2 and Q2 Q1 .

Equivalent rewritings Definition Let Q be a query, V = {V1, V2, …, Vm } be a set of view definitions. The query Q’ is an equivalent rewriting of Q using V if: • Q’ refers only to the views in V; • Q’ is equivalent to Q.A query Q1 is said to be contained in a query Q2,

Í Í Maximally-contained rewritings Definition Let Q be a query, V = {V1, V2, …, Vm } be a set of view definitions, and L be a query language. The query Q’ is maximally-contained rewriting of Q’ w.r.t. L if: • Q’ is a query in L that refers only to the views in V; • Q’ is contained in Q; • there is no rewriting Q1 L, such that Q’ Q1 Q, and Q1 is not equivalent to Q’.

Certain Answers • Problem: finding all the answers to a query given a set of views. • Not equivalent to maximally-contained rewriting because Maximal containment relies on languages. • Formalized by certain answers (Abiteboul et.al. 98) • A tuple α is a certain answer of Q w.r.t. a set of view definitions {Vi} and their extensions {vi}, if α is inQ(D) for any possible database instance D such that Vi (D) = vi (CWA) or Vi (D) vi (OWA) .

Conditions of view usability

View usability conditions For SPJ views to be usable in an equivalent rewriting of a SPJ query Q under bag semantics: • There is a mapping ψfrom occurrences of tables mentioned in the from clause of V to those mentioned in the from clause of Q, mapping every table name to itself. For bag semantics, ψmust be 1-1. • V must either apply the join and selection predicates in Q on the attributes on the attributes of the tables in the domain of ψ, or must apply to them a logical weaker selection, and select the attributes on which predicate still need to applied. • V must not project out any attributes of the tables in the domain of ψthat are needed in the selection of Q.

Using materialized view in query optimization

System-R style optimization

System-R style (cont.)

Queries with grouping and aggregation Example: View: create view V as select c-number, year, Max(evaluation) as maxeval, Count(*) as offerings from Teaches where c-number ≥ 400 group by c-number, year Query: select year, Count(*), Max(evaluation) from Teaches where c-number ≥ 500 group by year

Queries with grouping and aggregation (cont.) The query can be rewritten to: select year, sum(offering), Max(evaluation) From V where c-number ≥ 500 group by year Comment: • More limitations if grouping and aggregation are concerned. • Grouping in view must be finer than that in query. • Aggregations in query must be recoverable from the output fields and aggregations in the view.

Answering queries using views for data integration

Main approches • Using datalog query representation for both Q and V. • Algorithms: • Bucket algorithm (Levy et al. 96) • Inverse rules algorithm (Qian et al. 96 ) • MiniCon algorithm (Pottinger et al. 00)

Bucket algorithm • Create a bucket for each non-comparison subgoal g in Q: For each subgoal g’ in V, if there is a unifier θ for g and g’and the view, and after unification, • the comparison predicates in Q and V are simultaneously satisiable; • if a variable appears in head(Q) and subgoal g in the query, the corresponding variable in g’ alsoappears in head(V) in V, add θ(head(V)) into the bucket of g. • Find a set of conjunctive query rewritings, and each produces a conjunctive query including one conjunct from each bucket. It is a conjunctive rewriting if either • The conjunctive is contained in Q, or • It is possible to add atoms of comparison predicates such that the resulting conjunction is contained in Q.

Bucket algorithm example V1(student, c-number, quarter, title):- Registered(student, c-number, quarter), Course(c-number, title), c-number ≥ 500, quarter ≥ Aut98. V2(student, prof, c-number, quarter):- Registered(student, c-number, quarter), Teaches(prof, c-number, quarter) V3(student, c-number):- Registered(student, c-number, quarter), quarter ≤ Aut94. V4(prof, c-number, title, quarter):- Registered(student, c-number, quarter), Course(c-number, title), Teaches(prof, c-number, quarter), quarter ≤ Aut97.

Bucket algorithm example (cont.) Query: Q(S,C,P) :- Teaches(P,C,Q), Registered(S,C,Q), Course(C,T), C ≥ 300, Q ≥ Aut95. Bucket:

Bucket algorithm example (cont.) Result of rewriting: q’(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’) q’(S,C,P) :- V4(P,C,T’,Q), V1(S,C,Q,T’), V4(P’,C,T,Q’) q’(S,C,P) :- V2(S,P,C,Q), V4(P,C,T’,Q) The second query is empty, so the result is the union of the first and the third conjunctive queries.

Bucket algorithm comments • Advantage • Prune significant number of query rewritings. • Return maximally-contained rewriting when the query does not have comparison. • Disadvantage • Cartesian product of buckets is still large • Testing query containment is costly and -complete.

Inverse-rules algorithm • Construct a set of rules that invert the view definitions. • Idea: each tuple in the head of view definition query is a witness of tuples in relations corresponding to subgoals in the body. • Assign one skelom function symbol for each existential variable in the view definition.

Inverse-rules algorithm example Example: View definition: V3(dept, c-number) :- Major(student, dept), Registered(student, c-number) Inverse rules: Major(f1(dept, X), dept) :- V3(dept, X) Registered(f1(Y, c-number), c-number) :- V3(Y, c-number)

Inverse-rule algorithm example (cont.) Query: q(dept) :- Major(student, dept), Registered(student, 444) V3 has tuples: {(CS, 444), (EE, 444), (CS 333)} Applying inverse rules: Registered: {(f1(CS, 444), CS), (f1(EE, 444), EE), (f1(CS, 333), CS)} Major: {(f1(CS, 444), 444), (f1(EE, 444), 444), (f1(CS, 333), 333)} Answer: {EE, CS}

Inverse-rule algorithm comments • Advantage • Simplicity and modularity • Return maximally-contained rewriting • Disadvantage • Keep more non-contributive views than bucket algorithm • Require recomputing the relations from the views. The reason to use precomputed materialized views is lost.

MiniCon algorithm • Improvement on bucket algorithm. • Aim to eliminate more views that are useless to the query. • When we find a unification between a subgoal g’ in V and a subgoal g in Q, all other subgoals that join with g in Q are examined. V must either have the join attribute in its head, or contain the corresponding joined subgoals in the body. • For each view, compute a MiniCon consisting all subgoals in the query the view contributes.

MiniCon example Example: q(D) :- Major(S, D), Registered(S, 444, Q), Advises(P, S) V1(dept) :- Major(student, dept), Registered(student, 444, quarter). V2(prof,dept,area) :- Advises(prof, student), Prof(name, area). V3(dep,c-number) :- Major(student, dept), Registered(student, c-number, quarter) Advises(prof, student) MiniCon(V1) = Φ, MiniCon(V2) = Φ, MiniCon(V3) = {Major, Registered, Advises}

Theoretical results(very selective)

Completeness • Question: given a query Q and a set of views V, will the algorithm find an equivalent rewriting of Q using V, when there one exists? • When a CQ has no comparison predicates and has n subgoals, there exists an equivalent conjunctive rewriting of Q using V only if there is a rewriting with at most n subgoals. The complexity is NP-hard. (Levy et al. 1995)

Recursive rewriting • Goal: when we apply maximally-contained rewriting, we can also get the set of all certain answers. • Recursive query rewriting is necessary when: • The query is recursive. • Database relations have functional dependencies. • There exist access pattern limitations on the views. • Views have unions. • Additional semantic information about class hierarchies on objects is expressed in DL.

Recursive rewriting (example with fd) Relation: schedule(Airline, Flight_no, Date, Pilot, Aircraft) FDs: Pilot -> Airline, Aircraft->Airline View: V(D,P,C) :- schedule(A, N, D, P, C) Query: Q(P) :- schedule(A, N, D, ‘mike’, C), schedule(A, N’, D’, P, C’) Rewriting: relevantPilot(‘mike’) relevantAircraft(C) :- v(D, ‘mike’, C) relevantAircraft(C) :- v(D, P, C), relevantPilot(P) relevantPilot(P) :- relevantPilot(P1), relevantAircraft(C), v(D1, P1, C), v(D2, P, C)

Finding certain answers • Open-world assumption: polynomial in most practical cases. NP-hard (in the size of view extensions) if unions are allowed in view definition or inequality predicates are allowed in query languages. • Close-world assumption: co-NP-hard even if both views and queries are CQs and have no comparison predicates. c.f. GAV: polynomial • In cases views can contain incorrect tuples : • assume no comparison predicates in views or query • If all views are complete or all views may have incorrect tuples: ploynomial in view ext. size • o.w.: co-NP-hard

Extensions

Leveraging Materialized Views for Query Optimization and Data Integration

Leveraging Materialized Views for Query Optimization and Data Integration

Presentation Transcript

Presentation Title Presentation Subtitle

Presentation Presentation

Presentation on Presentation

PRESENTATION TITLE Presentation Subtitle

Presentation Code Presentation Title

Virtualization for Adaptability Project Presentation CS848 Fall 2006

Presentation Title Presentation Subtitle

Presentation Title Presentation Subtitle

PRESENTATION TITLE PRESENTATION TITLE PRESENTATION TITLE

A Presentation on Presentation

Presentation title Presentation title

Presentation Name Presentation Date

CS848 Class Project: A Survey on QoS for Multi-tier Web Systems

Presentation Title Presentation by

Presentation

A Presentation on Presentation

Presentation Title Presentation subtitle