Exploiting Similarity for Query Optimization

COSC6421 Advanced Database Systems Instructor Jarek Gryz Exploiting Similarity of Subqueries for Complex Query Optimization[1] Presented by Qiong Wang York University

Outline • Introduction-Why exploiting similarity • Description of Technique • Experimental Result • Advantages and Limitation Discussion • Related work • Summary

Background Introduction • User queries become more and more complex, such as increasing complexity of the operations and the query structure. • Query optimizers are seeking an efficient query execution plan( QEP) in DBMS. But, there are some limitations • Limitations of two types of current algorithms determining a optimal join orders • Dynamic programming -> Optimal Solution but long optimization time • Heuristics Based Algorithm-> Good time complexity but suboptimal solution • Limitations of some other studies which can find a good plan • May not make use of some special characteristics of the underlying queries • Only exploiting the common sub-queries in a query

Technique Introduction • A new similarity-based optimization technique. • Takes the structural characteristics of a complex query into consideration • The key idea • Identify groups of similar sub-queries that often appear in a complex query • Share the optimization result within each group in the query

Technique Description • Preliminary • Exploiting Similarity of Sub-Query • Optimizing Query

Preliminary A query Q • T={R1,R2,R3,……,Rm} The set of Base Tables referenced in Q • P ={p1,p2,p3,……pn} The set of predicates referenced in Q • Table Instance - each table reference in Q

Preliminary (1) A Query Graph G • Vthe set of all vertices inG <=Table Instances For x ∈ V, R ∈ T,mappingδ(x) = R • E the set of all edges in G <= Predicates For e ∈ E, c ∈ 2P , e is labeled with ϕ(e) = c • If there is at least one predicate inP involving vertices x and y, then query graph G has an edge e between x and y • The set of all predicates involving x and y labeled on the edge e

Constructing Query graph e (w, w) e (x,y) x e (y,w) e (y,z) y w z An example query graph G (V,E,T,P,δφ) • Sizeof(x) Size of the table represented by x • Sel (e) Selectivity of ϕ(e), i.e., the selectivity of the conjunction of all the simple predicates in ϕ(e)

Preliminary • A data structure which represents a query graph • Every vertex x in query graph G has a node x • Node x has a set of adjacent nodes y1, y2, ..., yn • A ring : x → y1 → y2 → ... → yn → x • to represent such an adjacency relationship. • Node x is called the owner (node) of the ring, • Nodes y1, y2, ..., yn are called the members of the ring. (2) Ring NetWork

Preliminary Suppose we have two sub-query graphs : G’(V’ ,E|V’ , T|V’ , P|V’ , δ|V’ , ϕ|V’ ) G”(V”, E|V”, T|V”, P|V”, δ|V”, ϕ|V”) Rt Error bound for table size Rs Error bound for condition selectivities (3) Similar Sub-query Graphes

Constructing Query graph If G’ and G” satisfy the following conditions, they are regarded as a pair of similar sub-query graphs with respect to error bounds rt and rs, G’ ≈ |(rt,rs)G” Similar Sub-query Graphes (cont’d)

Example of a query graph with similar sub-queries

Exploiting Similarity of Sub-Queries Algorithm Design Idea • Identify pairs of similar sub-queries with respect to the given error bounds rt and rs • Optimize one sub-query in a similar pair, and map and apply the resulting execution plan to the other sub-query • Replace similar sub-queries with their (estimated) result tables in the query graph and optimize the resulting modified query.

R2(10K) R6(40K) R3(25K) R2(10K) 10 8 1 5 9 11 6 4 3 2 7 12 R3(25K) R1(80K) R2(10K) R7(70K) R5(60K) R4(12K) R8(20K) Exploiting Similarity of Sub-Queries Algorithm Explanation Assume: all self-loops are removed 1. Choosing Starting Nodes: (1) Construct ring network and similarity starting lists

Exploiting Similarity of Sub-Queries Similarity Starting List OL2 Each list is for one base table and has a header containing : -Base Table Name Ri, -Two sublists OLi –all its instances (nodes) SLi --other table instances whose sizes are within error bound rt with respect to the size of Ri SL2 Base Table R2

Choosing Starting Nodes (cont’d) (2) In order to find a pair of as large as possible similar subquery graphs, we should choose a pair of nodes with the maximum number of adjacent node pairs. • Because the larger the similar subquery graphs in a pair, the more the optimization work can be shared.

How to Choose Starting Nodes For any pair of potential starting nodes x and y selected from a similarity starting list, we have a formula: *m be the size of set T for query Q choosing a pair of starting nodes is to choose the pair that maximizes the value of formula (1).

How to Choose Starting Nodes Two indicator arrays O (occurrence) and S (similarity) For each table instance The lengths of arrays O and S are the size of T • Ox[i] indicates current adjacent nodes representing base table Ri • Sx[i] indicates current adjacent nodes whose table sizes are within the given error tolerance with respect to Ri excluding Ox[i].

How to Choose Starting Nodes (cont’d)

Searching for Similar Sub-query Graphs 2. For all the unselected nodes (x1 and y1) in the rings of x, y • Check if sizeof(x1) and sizeof(y1) are within error bound rt • Check if adding x1 and y1 into the current sub-query pair graph will violate the similarity of sub-queries or not • Add x1, y1 into sub-query pair graphs

Searching for Similar Sub-query Graphs (Cont’d) 3. Use two threshold values c1 and c2 to determine whether to accept, reject, or hold a pair of similar sub-query graphs. • Let n be the number of nodes in a similar subquery graph • (1) if n ≥ c2, then the new pair of similar subquery graphs is accepted; • (2) if c1 ≤ n < c2, then we put this pair on hold; • (3) if n < c1, then the new pair is rejected.

4. Remove all the nodes in the accepted graphs and choose another starting nodes and repeated the procedure until no pair of nodes can be expanded.

5. Optimizing Query • Apply an optimization algorithm to optimize one of the sub-query graphs in each pair. • Map the execution plan for one sub-query to the one for its partner • Example, G1 and G2 Nodes: x1 ↔ y1, x2 ↔ y2, x3 ↔ y3, By optimizing G1, we get such a plan: ((x1×x2) × x3). The mapped plan for G2 is ((y1×y2) ×y3).

Experimental Result Comparison of I/O costs for execution plans generated by two techniques For a set of randomly-generated test queries.

Effect of changing error bounds on I/O costs

Effect of changing error bounds on I/O costs • Very small error bounds cannot yield good performance, since the smaller the error bounds, the smaller the similar sub-queries; • Moderate error bounds (0.15 - 0.40) yield the best performance, since similar sub-queries with reasonable sizes can be found; • Large error bounds lead to poor performance, since sub-queries are less similar in such cases, which makes that sharing execution plans between them may not be appropriate; • Very large error bound (close to 1), the performance of the similarity-based technique stays at the same level. • The sizes of similar sub queries may reach their maximums • when the error bounds are beyond a certain limit.

Advantages and Limitation Discussion • Advantages: • Take the features of original structures of complex query into consideration and design a query graph • Exploiting the similarity of Sub-queries beyond common sub-query, reduce the optimization time by sharing the same mapped plan • Reduce the query complexity before apply any optimization algorithm. • Limitation: • According to the experimental result, the selectivity of similarity sub-queries depends on the error bound. • 2. The paper didn’t mention the application of this technique in multiple queries.

Related Work (1) The author also introduced a technique to optimize a complex query by exploiting common sub-queries recently. This technique is shown to be more effective than a pure randomization based method since it takes the structural characteristics of a complex query into consideration. (2) Some other Algorithms used to optimize large query: Iterative improvement (II) Simulated Annealing (SA) Tabu Search (TS) AB algorithm (AB) Generic Algorithm (GA)

Related Work (2) Related topic-Multiple-Query • A single query given to a system may result in multiple queries • Queries are given to the system from various users. Then batching all users’ requests is possibly required. • Some parts of the query are automatically generated by programs

Multiple Query Optimization • Major issue in multiple-query processing is the redundancy due to accessing the same data multiple times in different queries.[2] • Recognize the possibilities of sharing • Exploiting common sub-expressions and reduce evaluation cost • Reusing Materialized or intermediate results from other computation • Example: Volcano-SH and RU [3] • Modify the optimizer search strategy and find a globally optimal plan. Example: IE (Interleave Execution) and HA (Heuristic Algorithm) [2]

Summary • Described a new query optimization technique for complex query, “exploiting similarity of Sub-queries” • Algorithm Design Idea • Query Graph, Ring Network, Similar Sub-query Graph • Discussed experimental result, advantages and limitation • Related work Discussion

Reference [1] Yingying Tao, Qiang Zhu and Calisto Zuzarte. Exploiting Similarity of Subqueries for Complex Query Optimization, LNCS 2736, Jan 2003, Pages 747 - 759 [2] TIMOS K. SELLIS and C. Lin. Multiple Query Optimization, ACM TODS13(1) p23-52 (1988) [3] Prasan Roy, S. Seshadri, S. Sudarshan, Siddhesh Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization, ACM 2000 1-58113-218-2/00/05 [4] Jamal R.Alsabbagh, Vijay V. Raghavan. Analysis of Common Subexpression Exploitation Models in Multiple-Query Processing

Thank You!

Exploiting Similarity for Query Optimization

Exploiting Similarity for Query Optimization

Presentation Transcript

Chapter 7 : SUBQUERIES

Query Processing

Correlated Subqueries in PROC SQL

Multi-Query Optimization

Subqueries

Semantic Web - Query Languages –

Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization

Lecture 9: Query Optimization

CMPT 454

Query Optimization

Lecture 9 Query Optimization

Lecture 25: Query Optimization

Query Processing and Optimization

Chapter 7 Subqueries

AM18 ASA INTERNALS: QUERY EXECUTION AND OPTIMIZATION

Query Processing and Query Optimization

Subqueries

4 Exploiting Click Streams and Query Logs

Query Optimization

Software Specification KXA233 Lecture 5A Group Functions and Subqueries

SQL

Chapter 14: Query Optimization