Distributed Database Management Systems

Distributed Database Management Systems Lecture 33

In the previous lecture • Final phase of QD • Data Localization: for HF, VF and DF.

In today’s Lecture • Data Localization for Hybrid Fragmentation • Query Optimization.

Reduction for HyF • HyF contains both types of Fragmentations • EMP1=eNo ≤ E4 (eNo, eName (EMP)) • EMP2=eNo > E4 (eNo, eName (EMP)) • EMP3=eNo, title (EMP).

eName eName eNo = E5 eNo = E5 ⋈eNo U EMP2 Reduced Query- EMP3 EMP1 EMP2 • Select eName from EMP where eNo = E5

Summary of what we have done so far

Query Decomposition: generates an efficient query in relational algebra • Normalization, Analysis, Simplification, Rewriting • Data Localization: applies global query to fragments; increases optimization level-

So, next is the cost-based optimization • Mainly concentrates on the order of performing joins • Characteristics of relations like cardinalities are considered

First QO in general • QO refers to producing a Query Execution plan (QEP) that represents execution strategy.

Components of Optimizer • Search Space: set of eq. alternative exec plans • Cost Model: predicts cost of a execution plan • Search Strategy: produces best plan

Search Space • Search space consists of eq. Query Trees produced using Tr Rules • Optimizer concentrates on join trees, since join cost is the most effective

Example: • Select eName, resp From EMP, ASG, PROJ where EMP.eNo = ASG. eNo and ASG.pNo = PROJ.pNo.

⋈pNo ⋈eNo ⋈eNo ⋈pNo PROJ EMP EMP ASG PROJ ASG ⋈pNo, eNo x ASG EMP PROJ

Alternatives with N relations are O(N!) based on properties of relations • So, restrictions are applied

1- Heuristics - Selection and projection on base relations - Avoid Cartesian product

2- Shape of Tree • Linear Tree: At least one node for each operand is a base relation • Bushy tree: May have operators with interm tables only; allows parallel execution

Search Strategy • Most popular is Dynamic Programming • That starts with base relations and keeps on adding relations calculating cost

DP is almost exhaustive so produces best plan • Too expensive with more than 5 relations • Other option is Randomized strategy • Do not guarantee best

Cost Model • Cost of operators, statistics of base data to predict size of intermediate tables • Cost considered as Total Time and Response Time.

Total time = CPU time + I/O time + tr time • In WAN, major cost is tr time • Initially ratios were 20:1 for tr and I/O, for LAN it is 1:1.6

Response time = CPU time + I/O time + tr time • Difference.?

TCPU = time for a CPU inst • TI/O = a disk I/O • TMSG = fixed time for initiating and recv a msg • TTR = transmit a data unit from one site to another.

X units Site 1 Site 3 Site 2 Y units • TT = 2TMSG + TTR*(x+y) • RT = max{TMSG + TTR*X, TMSG + TTR*Y}

Database Statistics • Major factor is interm tabs • If the interm results are to be transmitted, then estimation about size is a must • More precise statistics cost more

For each relation R[A1, A2, …, An] fragmented as R1, …, Rr • length of each attribute: length(Ai) • the number of distinct values for each attribute in each fragment: card(Ai(Rj)) • maximum and minimum values in the domain of each attribute: min(Ai), max(Ai).

4.The cardinalities of each domain: card(dom[Ai]) and the cardinalities of each fragment: card(Rj) 5.Join selectivity factor for some of the relationsSFJ (R,S) = card(R ⋈ S)/ (card(R) ∗ card(S))-

Cardinalities of Intermediate Results

Selection Operation • Card(F(R))=SFS(F) * card(R) • SFS(A = value) = 1/card(A(R)) • SFS(A > value) = max(A) – value /(max(A) – min(A)) • SFS(A < value) = value - min(A) /(max(A) – min(A))

SFS(A < value) = max(A) – value /(max(A) – min(A)) • SFS(p(Ai) ^ p(Aj)) = SFS(p(Ai)) * (SFSp(Aj)) • SFS(p(Ai) v p(Aj)) = SFS(p(Ai)) + SFS(p(Aj))–(SFS(p(Ai))* SFS(p(Ai))).

Cardinality of Projection • Hard to determine precisely • Two cases when it is trivial 1- When a single attribute A, card(A(R)) = card (A) 2- When PK is included card(A(R)) = card (R)

Cartesian Product • card(RxS) = card (R) * card(S). • Cardinality of Join • No general way to test without additional information • In case of PK/FK combination Card(R ⋈ S) = card (S)

Semi Join: SFSJ(R ⋉AS)= card(A(S))/ card(dom[A]) card(R ⋉AS) = SFSJ(S.A) * card(R).

Union: Hard to estimate • Limits possible which are card(R) + card(S) and max{card (R) + card (S)) • Difference: Like Union, card (R) for (R-S), and 0

Centralized Query Optimization

Why to Study • Distributed Query is transformed into local ones • Issues are related and more complex in DD • Easier to understand • Two famous ones.

INGRES • Dynamic • Recursively breaks into smaller ones • System R • static • exhaustive search

Distributed Database Management Systems