1 / 70

Query Processing in the Presence of Limited Source Capabilities

Query Processing in the Presence of Limited Source Capabilities. Chen Li Information and Computer Science UC Irvine. Information integration. Biblio sever. Legacy database. Plain text files. Support seamless access to autonomous and heterogeneous information sources.

lester
Download Presentation

Query Processing in the Presence of Limited Source Capabilities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Processing in the Presence of Limited Source Capabilities Chen Li Information and Computer Science UC Irvine

  2. Information integration Biblio sever Legacy database Plain text files Support seamless access to autonomous and heterogeneous information sources.

  3. Mediation architecture Mediator Wrapper Wrapper Wrapper Source n Source 1 Source 2

  4. title author Ullman DBSI TeX Knuth … … Get all books. Limited Source Capabilities • support “select * from R” queries Traditional DBs

  5. title author Given an author, return the books. Ullman DBSI TeX Knuth … … Limited source capabilities (cont) • However, in many environments, complete scans of relations may not be possible • Reasons: • Legacy databases or structured files: limited interfaces • Security/Privacy • Performance concerns

  6. Example: Web search forms www.imdb.com

  7. Research summary Some of my work on query processing in the presence of limited source capabilities: • Generating a feasible plan for a query: demo SIGMOD'1998. • Optimizing large-join queries: ICDT’1999. • Describing source capabilities and computing capabilities of mediators: SIGMOD'1999. • Computing maximal answers to a query by borrowing information from sources not in the query : ICDE'2000, TODS 2001. • Deciding whether all the answers to a query can be computed and testing relative query containment: ICDT'2001, Journal of VLDB. • Other work: e.g., answering queries using views with binding patterns [RSU95], query rewriting for semi-structured data [PV99], etc.

  8. Describing source capabilities Attribute adornments: f: free b: bound u: unspecified o[S]: optional, if chosen, must be from a list S of constants c[S]: chosen from a list S of constants A search form is represented as multiple templates: (Title, Author, ISBN, Format, Subject) u u b u u b f u u u f b u u u o[] u u o[] o[] [SIGMOD’99]

  9. Binding patterns • Common source limitations • Attributes with adornments: • b: bound • f: free • Example: R(Author, Title) • Given an author, return the books. • R(Authorb, Titlef) • A relation can have multiple binding patterns.

  10. Rest of the talk • Part I: Optimizing large-join queries • Part II: Deciding whether all the answers to a query can be computed

  11. Part I: Optimizing large-join queries Given a query Q on relations with restrictions: • Can we answer Q? • I.e., find an executable plan to answer the query while observing the source access patterns • How to answer Q efficiently?

  12. T(Movie,Year) 1996 The Rock The Matrix 1999 … … Reeves The Matrix Connery The Rock … … Example: three movie sources S(Movie,Studio) R(Star, Movie) Warner Bros The Matrix DreamWorks American Beauty … …

  13. warner reeves 1999 Query Q “Find the movies made by Warner Bros. in 1999 and in which Keanu Reeves starred?” R(Star, Movie) S(Movie,Studio) T(Movie,Year) SQL: SELECT Movie FROM R JOIN S JOIN T WHERE Star = ’reeves’ AND Studio = ’warner’ AND Year = 1999;

  14. Movie Year=1999 Studio=warner Star=reeves Answer Q Plan P0 R(Star, Movie) S(Movie,Studio) T(Movie,Year)

  15. What if limited source capabilities? • R(Starb, Movief): requires a Star name • S(Movieb,Studiof): requires a Movie title • T(Movieb,Yearf): requires a Movie title

  16. Star=reeves Year=1999 Studio=warner Does P0 work? No! Since S and T do not support the queries. R(Starb, Movief) S(Movieb,Studiof) T(Movieb,Yearf)

  17. Reeves movies by warner Reeves movies Reeves movies by warner of 1999 Star=reeves A feasible plan P1 R(Starb, Movief) T(Movieb,Yearf) S(Movieb,Studiof)

  18. Reeves movies Reeves movies of 1999 by warner Reeves movies of 1999 Star=reeves Another feasible plan P2 R(Starb, Movief) T(Movieb,Yearf) S(Movieb,Studiof)

  19. Question 1: query answerability • Given Q on relations with restrictions, can we process its conditions by accessing relations with legal patterns?

  20. Consider SPJ Queries • Select-project-join queries (conjunctive queries): q(X) :- g1(X1),…, gn(Xn) • subgoal gi(Xi): gi is a relation, Xi is a tuple of variables/constants • Example: SELECT Movie FROM R JOIN S JOIN T WHERE Star = ’reeves’ AND Studio = ’warner’ AND Year = 1999; q(M) :- R(reeves,M),S(M,warner),T(M,1999)

  21. M M M reeves warner 1999 Bound positions: B = {reeves, warner, 1999} Algorithm “Inflationary”:Testing answerability of Q R(Starb, Movief), S(Movieb,Studiof), T(Movieb,Yearf) q(M) :- R(reeves,M),S(M,warner),T(M,1999) Check what subgoals can be processed given B More positions become bound: add {M} to B More subgoals can be processed All subgoals are answerable, so Q is answerable

  22. reeves movies by warner reeves movies reeves movies by warner of 1999 Star=reeves 4 12 1 Question 2: generate efficient plans? Consider number of source accesses. Plan P1 R(Starb, Movief) T(Movieb,Yearf) S(Movieb,Studiof) Number of source accesses: 1 + 12 + 4 = 17

  23. reeves movies reeves movies by warner of 1999 reeves movies of 1999 star=reeves 12 1 1 Cost of plan P2 Plan P2 R(Starb, Movief) T(Movieb,Yearf) S(Movieb,Studiof) Number of source accesses: 1 + 12 + 1 = 14

  24. How to generate efficient plans? • Challenges: • Often source statistics hard to get • Search space different from a traditional optimizer (e.g., System-R) • Ordering subgoals • Considering left-deep trees versus bushy trees • Deciding join methods (e.g., hash join, nested-loop join) • Need to consider feasible plans! • Cost model: total number of source accesses • Reason: each source access is expensive! • network traffic/delay, dynamic source availability, source charges • Results extendable to more general cost models, e.g.:

  25. U T S R S T U R Left-deep trees versus bushy trees Result: Left-deep trees guarantee to include an optimal plan. Left-deep tree Bushy tree

  26. Complexity • The problem of finding the optimal feasible plans is NP-hard • Proof: by reduction from the Vertex Cover Problem • Since the number of subgoals could be large, we want approximation algorithms for finding near-optimalplans quickly

  27. Case 1: no source statistics • Algorithm CHAIN: • Greedy approach • At each step, find a subgoal with the lowest cost

  28. CHAIN Collect source information as we process subgoals. R(Studiob, Movief ,Starf) S(Movieb,Yearf) Movie Star T(Starb,Addrf) Shrek Murphy Diaz Shrek Choose S next! Reeves Matrix 2 movies 3 stars

  29. CHAIN: Properties • Does not need source statistics • Polynomial time: O(n2), n is the number of subgoals • Only needs results returned from the sources • It is n-competitive: Cost(PLANchain) <= n * Cost(PLANopt)

  30. Case 2: source statistics available • Algorithm: PARTITION • Grouping approach: group subgoals into clusters • Find an optimal subplan within each cluster • Combine subplans to construct a plan

  31. answerable subgoals given initial constants in Q optimal subplan p1 group 1 new bound args complete plan optimal subplan p2 group 2 new answerable subgoals … new bound args optimal subplan pk group k new answerable subgoals PARTITION

  32. PARTITION: Properties • Need source statistics • Guarantees an optimal plan if 1 or 2 clusters • No bound when missing optimal plans

  33. Performance analysis • Test bed: • 15 source relations; number of subgoals: 1 – 10 • For each number of subgoals, ran 1000 random queries • Other factors considered: • number of binding patterns for a relation • number of constants in a query • cardinalities of attributes • Results: • Both algorithms generate good plans • PARTITION takes more time than CHAIN • PARTITION generates better plans than CHAIN

  34. Probability of missing optimal plans • Average probability of missing optimal plans: • CHAIN <= 25% • PARTITION <= 5%

  35. Difference from the optimal plans • Average difference: • CHAIN < 5% • PARTITION < 2%

  36. Extending to other cost models • Different sources have different costs: • CHAIN is still n-competitive • PARTITION still finds an optimal plan if the number of groups is <= 2 • Consider size of data transferred: • CHAIN is still n-competitive • PARTITION may miss the optimal plans even if the number of groups is <= 2

  37. Part II Given a query Q on relations with restrictions: Deciding whether all the answers to a query can be computed

  38. Motivation • Take Conjunctive queries (CQ’s) as an example • If the Inflationary algorithm terminates while some subgoals are not answerable, can we say there is no way to compute Q’s answers? • No!

  39. Harrison Ford Air Force One On Golden Pond Oscar, Best Actor Henry Fonda On Golden Pond On Golden Pond Oscar, Best Actress Kevin Spacey American Beauty American Beauty Oscar, Best Picture … … … … Example: A movie database Q(Award) :- r(henry fonda,Movie), s(Movie,Award) r(Star, Movie) s(Movie, Award)

  40. Harrison Ford Air Force One On Golden Pond Oscar, Best Actor Henry Fonda On Golden Pond On Golden Pond Oscar, Best Actress Kevin Spacey American Beauty American Beauty Oscar, Best Picture … … … … Limited access patterns r(Starb, Movief) Should provide a star. Should provide a movie. s(Movieb, Awardf)

  41. Harrison Ford Air Force One On Golden Pond Oscar, Best Actor Henry Fonda On Golden Pond On Golden Pond Oscar, Best Actress Kevin Spacey American Beauty American Beauty Oscar, Best Picture … … … … Answering Q given the restrictions Q(Award) :- r(henry fonda,Movie), s(Movie,Award) s(Movieb, Awardf) r(Starb, Movief)

  42. Harrison Ford Air Force One On Golden Pond Oscar, Best Actor Henry Fonda On Golden Pond On Golden Pond Oscar, Best Actress Kevin Spacey American Beauty American Beauty Oscar, Best Picture … … … … The answer is complete • We did not retrieve all the tuples from the relations. • Still we computed all tuples in the answer to the query. Q(Award) :- r(henry fonda,Movie), s(Movie,Award) s(Movieb, Awardf) r(Starb, Movief)

  43. Run Inflationary algorithm • Every subgoal would be answerable Q(Award) :- r(henry fonda,Movie), s(Movie,Award) r(Starb, Movief) s(Movieb, Awardf)

  44. How about Q’? Q’(Award) :- r(henry fonda,Movie), s(Movie,Award),r(Star,Movie) • Inflationary will find that subgoal r(start,Movie) is not answerable • variable Star cannot be bound • Can we say Q’ cannot be answered? • No! • It is essentially equivalent to the old query Q • Thus we can answer Q’ by answering Q!

  45. Observations of binding patterns If a relation does not have an “all-free” binding pattern, then after certain queries are sent to this relation, there can always be some tuples that have not been retrieved.

  46. General questions • Given a query on relations with limited access patterns, can we compute its complete answer by accessing the relations with legal patterns? • If so, called “Stable” queries • The example shows that the solution is more than running the Inflationary algorithm

  47. General questions (cont) • Given a query Q, if Inflationary claims that some subgoals are not answerable, how do we know whether there is another equivalent query Q’, such that Inflationary “succeeds” on Q’? • Notice that there are infinite number of equivalent queries of Q • Furthermore, even if Inflationary were able to say that all these equivalent queries have some unanswerable subgoal, how do we know if there isn’t any “magical” plan that can compute all the answers to Q?

  48. Query stability • A query Q on relations with binding patterns is stable if for any database, we can compute its complete answer by accessing the relations with legal patterns. • The complete answer is the computable answer if we could retrieve all the tuples from the relations. • Use partial tuples to derive the complete answer: we need reasoning!

  49. Feasible CQ’s • A CQ is feasible if it has a feasible (i.e., executable or answerable) order of all its subgoals. • Lemma: A feasible CQ is stable. • Testing feasibility of a CQ: Inflationary algorithm

  50. What if Q is not feasible? Our example shows that an infeasible query could still be stable

More Related