Create Presentation
Download Presentation

Download Presentation

Distributed Query Optimization Algorithms

Download Presentation
## Distributed Query Optimization Algorithms

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Distributed Query Optimization Algorithms**• System R and R* • Hill Climbing and SDD-1**System R (Centralized) Algorithm**• Simple (one relation) queries are executed according to the best access path. • Execute joins • Determine the possible ordering of joins • Determine the cost of each ordering • Choose the join ordering with the minimal cost • For joins, two join methods are considered: • Nested loops • Merge join**System R Algorithm -- Example**Names of employees working on the CAD/CAM project • Assume • EMP has an index on ENO, • ASG has an index on PNO, • PROJ has an index on PNO and an index on PNAME**System R Algorithm -- Example**• Choose the best access paths to each relation • EMP: sequential scan (no selection on EMP) • ASG: sequential scan (no selection on ASG) • PROJ: index on PNAME (there is a selection on PROJ based on PNAME) • Determine the best join ordering • EMP ASG PROJ • ASG PROJ EMP • PROJ ASG EMP • ASG EMP PROJ • EMP PROJ ASG • PROJ EMP ASG • Select the best ordering based on the join costs evaluated according to the two methods**EMP ASG**ASG PROJ PROJ ASG (PROJ ASG) EMP (PROJ ASG) EMP (ASG EMP) PROJ (ASG EMP) PROJ System R Example (cont'd) • Best total join order is one of PROJ EMP ASG EMP × PROJ ASG EMP PROJ × EMP**System R Algorithm**• (PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP. • Final plan: • select PROJ using index on PNAME • then join with ASG using index on PNO • then join with EMP using index on ENO**System R* Distributed Query Optimization**• Total-cost minimization. Cost function includes local processing as well as transmission. • Algorithm • For each relation in query tree find the best access path • For the join of n relations find the optimal join order strategy • each local site optimizes the local query processing**Data Transfer Strategies**• Ship-whole. entire relation is shipped and stored as temporary relation. If merge join algorithm is used, no need for temporary storage, and can be done in pipeline mode • Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple**Join Strategy 1**• External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s • Strategy 1. Ship the entire outer relation to the site of internal relation TC = LC(get R) + CC(size(R)) + LC(get s tuples from S)*card(R)**Join Strategy 2**• Ship the entire inner relation to the site of the outer relation TC = LC(get S) + CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R)**Join Strategy 3**• Fetch tuples of the inner relation for each tuple of the outer relation TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) * card(R) + CC(s*len(S))*card(R)**Join Strategy 4**• Move both relations to 3rd site and join there TC = LC(get R) + LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R) • Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost**Hill Climbing Algorithm - Algorithm**Inputs query graph, locations of relations, and relation statistics Initial solution the least costly among all when the relations are sent to a candidate result site denoted by ES0, and the site as chosen site Splits ES0 into ES1: ship one relation of join to the site of other relation ES2: these two relations are joined locally and the result is transmitted to the chosen site If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0, else select ES1 and ES2. The process can be recursively applied to ES1 and ES2 till no more benefit occurs**ES0**Cost = 13 Site1 EMP(8) Site2 PAY(4) 8 4 ENO Site3 PROJ(1) Site4 ASG(10) 1 PNO Hill Climbing Algorithm - Example SAL TITLE PAY EMP PNAME=“CAD/CAM” ASG Ignore the local processing cost Length of tuples is 1 for all relation PROJ**ES0**Cost = 13 Site1 EMP(8) Site2 PAY(4) 8 4 Site3 PROJ(1) Site4 ASG(10) 1 Site1 EMP(8) ES1 HCA - Example ? Site2 PAY(4) ES2 Solution 1 Cost = TITLE ? Site4 ASG(10) Site3 PROJ(1) ? ES3 Site1 EMP(8) ES1 Site2 PAY(4) ESo is the “BEST” ES2 Solution 2 Cost = Site4 ASG(10) ES3 Site3 PROJ(1)**Hill Climbing Algorithm - Comments**• Greedy algorithm: determines an initial feasible solution and iteratively tries to improve it. • If there are local minimas, it may not find the global minima • If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution. Site1 EMP(8) Site2 PAY(4) Site4 ASG(10) Site3 PROJ(1) COST =**SDD-1 Algorithm**• SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles. • Cost of semijoin: Cost (R SJA S) = CMSG + CTR*size(A(S)) • Benefit is the cost of transferring irrelevant tuple Benefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR • A semijoin is beneficial if cost < benefit.**SDD-1: The Algorithm**• initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing • most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected • the above step is done until no more beneficial joins are left • assembly site selection to perform local operations • postoptimization removes unnecessary semijoins**SDD1 - Example**SELECT * FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO Site 2 ASG ENO PNO Site 1 EMP Site 3 PROJ**SDD1 - First Iteration**• SJ1: ASG SJ EMP benefit = (1-0.3)*3000 = 2100; cost = 120 • SJ2: ASG SJ PROJ benefit = (1-0.4)*3000 = 1800 cost = 200 • SJ3: EMP SJ ASG benefit = (1-0.8)*1500 = 300; cost = 400 • SJ4: PROJ SJ ASG benefit = 0; cost = 400 • SJ1 is selected • ASG size is reduced to 3000*0.3=900 ASG’ = ASG SJ EMP • Semijoin selectivity factor is reduced; it is approximated by SFSJ(G’.ENO)= 0.8*0.3 = 0.24, SFSJ(G’PNO)=1.0*0.3 =0.3, size(G’.ENO)= 400*0.3=120, size(G’.PNO) = 120**SDD-1 - Second & Third Iterations**Second iteration • SJ2: ASG’ SJ PROJ benefit=(1-0.4)*900=540 cost=200; • SJ3: EMP SJ ASG’; benefit=(1-0.24)*1500=1140 cost=120 • SJ4: PROJ SJ ASG’, benefit=(1-0.3)*2000=1400cost=120 • SJ4 is selected PROJ’ = PROJ SJ ASG’ size(PROJ’) = 2000*0.3 = 600SFSJ(J’)=0.4*0.3=0.12size(J’.PNO)=200*0.3=60 Third Iteration • SJ2: ASG’ SJ PROJ benefit=(1-0.12)*900=792 cost=60; • SJ3: EMP SJ ASG’; benefit=(1-0.24)*1500=1140 cost=120 • SJ3 is selected reduces size of E to 1500*0.24=360 • Finally SJ2 is selected, with size of G as 108**Local Optimization**• Each site optimizes the plan to be executed at the site • A centralized query optimization problem**SDD-1 - Assembly Site Selection**• After reduction EMP is at site 1 with size 360 ASG is at site 2 with size 108 PROJ is at site 3 with size 600 • Site 3 is chosen as assembly site • SJ4 is removed in post optimization. Site1 EMP Site2 ASG Site3 PROJ (ASG SJ EMP) SJ PROJ site 3 (EMP SJ ASG) site 3 join at site 3