Download
efficient detection of empty result queries n.
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient Detection of Empty Result Queries PowerPoint Presentation
Download Presentation
Efficient Detection of Empty Result Queries

Efficient Detection of Empty Result Queries

130 Views Download Presentation
Download Presentation

Efficient Detection of Empty Result Queries

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Efficient Detection of Empty Result Queries Gang Luo IBM T.J. Watson Research Center luog@us.ibm.com

  2. Empty Result Problem • Query returns an empty result set • User gets lost about where to look at next • Frequently encountered in interactive exploration of massive data sets • Our contribution: method for quickly detecting empty result sets

  3. Example Percentages of Empty Result Queries • In a Customer Relationship Management (CRM) application developed by IBM • 18.07% (3,396 empty result queries in 18,793 queries) • In a real estate application developed by IBM • 5.75% • In a digital library application [JCM+00] • 10.53% • In a bioinformatics application [RCP+98] • 38%

  4. Empty Result Queries May Not Finish Execution Quickly • Consider a query joining two relations • Query execution time is longer than join time, no matter whether or not query result set is empty • Even if a query finishes in a few seconds in a lightly loaded RDBMS, it may last longer than one minute in a heavily loaded RDBMS

  5. Outline • Limitations of previous approaches • Fast detection method for empty result queries • Some experiments

  6. Existing Solutions to the Empty Result Problem • Explain what leads to the empty result set • Automatically generalize the query so that the generalized query will return some answers

  7. Limitations of Existing Solutions • Require domain specific knowledge • Only apply to a restricted form of queries • Require an excessive amount of time • Give too many reasons why the result set is empty • Users cannot reuse each other’s query results

  8. Outline • Limitations of previous approaches • Fast detection method for empty result queries • Some experiments

  9. Our Solution • Only consider read-only environment • From previous queries’ execution, remember the query parts that lead to empty result sets • When a new query Q comes, match it with the remembered query parts. If such a match exists, report that Q will return an empty result set without executing Q • Utilize special properties of empty result sets and thus often more powerful than traditional materialized view method

  10. Definitions • Empty result propagating operator: An operator whose output is empty if any input is empty • Empty result propagating query: A query whose query plan only contains empty result propagating operators (our focus) • Query part: A sub-tree of a query plan • Atomic query part: An ordered pair (relation names RN, selection condition SC) • Corresponds to a relational algebra formula: first product join all relations in RN, then apply SC • SC is a conjunction of primitive terms, where each primitive term is a comparison

  11. Definitions – Cont. • Cover: Atomic query part P1=(RN1, SC1)covers atomic query part P2=(RN2, SC2) if • RN1RN2 • Whenever SC2 is true, SC1 is true • Property: Suppose atomic query part P1 covers atomic query part P2. For a given database, if the output of P1 is empty, the output of P2 is also empty.

  12.  [0] sort-merge join B.g=C.h [0] sort [1000] sort [0] C.f<300[1000]  [0] C (table-scan) [20000] hash join A.c=B.d [0] hash [200] hash [5000] 50<A.a<100  A.b=200[200]  [5000] A (table-scan) [40000] B (index-scan) B.e<40  B.e=50 [5000] Given an Empty Result Query • Find the lowest-level query part P whose output is empty

  13. Transforming P into a Simplified Query Part Ps • Drop all operators (e.g., projection, hash, sort) that have no influence on the emptiness of the output • Replace each physical join operator with a logical join operator • Replace each index-scan operator with a table-scan operator followed by a selection operator, where the selection condition is the index-scan condition

  14. ⋈A.c=B.d B.e<40  B.e=50 50<A.a<100  A.b=200 B (table-scan) A (table-scan) Transforming P into a Simplified Query Part Ps – Cont. • Corresponding relational algebra formula • (50<A.a<100  A.b=200 (A)) ⋈A.c=B.d(B.e<40  B.e=50(B))

  15. Breaking Ps into Atomic Query Parts • Get all selection conditions in the selection/join operators • Rewrite the conjunction of these selection conditions into a disjunctive normal form (DNF) • Negations on numeric or string attributes are removed using complementary operators • Interval-based comparison is treated as a single primitive term • Generate a set of atomic query parts (RN, SC) • RN: input relations of all table-scan operators in Ps • SC: a term in the DNF

  16. Breaking Ps into Atomic Query Parts – Cont. • Property: The following three assertions are equivalent to each other: • The output of the query part P is empty • The output of the simplified query part Ps is empty • The output of each generated atomic query part is empty (50<A.a<100 (A)) ⋈A.c=B.d(B.e<40(B)) (A.b=200 (A)) ⋈A.c=B.d(B.e<40(B)) (50<A.a<100 (A)) ⋈A.c=B.d(B.e=50(B)) (A.b=200 (A)) ⋈A.c=B.d(B.e=50(B))

  17. Storing the Generated Atomic Query Parts • For each generated atomic query part Pa • Insert Pa into a collection Caqp of atomic query parts • Remove from Caqp all previously stored atomic query parts that are covered by Pa • See paper for details of the coverage checking algorithm

  18. When Getting a New Query Q • Break Q into a set of atomic query parts • For each such atomic query part Pa, check whether some atomic query part Ai in Caqp covers Pa • If such an Ai exists for each Pa, report that Q will return an empty result set without executing Q

  19. Outline • Limitations of previous approaches • Fast detection method for empty result queries • Some experiments

  20. Setup • Testing environment • PostgreSQL 7.3.4 • Windows XP OS • Dell Inspiron 8500 PC with one 2.2GHz CPU, 512MB memory, one 40GB disk • TPC-R benchmark • See paper for detection probability analysis

  21. Overhead Experiment • Query Q1: Find the information about certain parts that were sold on certain days select * from orders o, lineitem lwhere o.orderkey=l.orderkey and (o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf);

  22. Overhead Experiment – Cont. • Query Q2: Find the information about certain parts that were sold to certain customers on certain days select * from orders o, lineitem l, customer cwhere o.orderkey=l.orderkey and o.custkey=c.custkey and (o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf) and (c.nationkey=n1 or … or c.nationkey=ng);

  23. Overhead Experiment – Cont. • The overhead of our method increases with both query complexity and the number of atomic query parts stored in Caqp • When check fails, the overhead of our method is higher than that when check succeeds

  24. Overhead Experiment – Cont. • The overhead of our method is trivial compared to query execution overhead

  25. Summary • Provide a fast detection method for empty result queries • Low overhead • High detection probability once enough information has been accumulated

  26. Open Issues • In the presence of update, correctly preserve as much stored information as possible • A hybrid method that can combine the advantages of both our method and the existing solutions • More aggressive storage saving technique