adaptively processing remote data
Download
Skip this Video
Download Presentation
Adaptively Processing Remote Data

Loading in 2 Seconds...

play fullscreen
1 / 35

Adaptively Processing Remote Data - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Adaptively Processing Remote Data. Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 28, 2005. Administrivia. Next reading assignment: Doan et al. – LSD Recall that the midterm will be due 3/16

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Adaptively Processing Remote Data' - yuval


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
adaptively processing remote data

Adaptively Processing Remote Data

Zachary G. Ives

University of Pennsylvania

CIS 650 – Database & Information Systems

February 28, 2005

administrivia
Administrivia

Next reading assignment:

  • Doan et al. – LSD
  • Recall that the midterm will be due 3/16
    • You can go ahead and choose a topic – let me know which one
sources of query answering cost
Sources of Query Answering Cost
  • Regular computation (this has a minimum cost)
  • But we can get held up by:
    • Inflexible query plans
      • Delays – we “stall” in waiting for I/O
        • Query scrambling
        • Pipelined hash joins
    • Bad query plans
      • Bad estimation of intermediate result sizes
        • The focus of the Kabra and DeWitt paper
      • Insufficient source information
        • Eddies and ADP
        • Exploration vs. exploitation; extrapolating performance
kabra and dewitt recap
Kabra and DeWitt Recap
  • Provides significant potential for improvement without adding much overhead
    • Biased towards exploitation, with very limited information-gathering
  • A great way to retrofit an existing system
    • In SIGMOD04, IBM had a paper that did this in DB2
  • But not appropriate for remote data
    • Relies on us knowing the (rough) cardinalities of the sources
    • Query plans aren’t pipelined
a second issue delays
A Second Issue: Delays
  • May have a very computationally inexpensive plan, but slow sources
    • The query plan might get held up waiting for data
  • Solution 1: query scrambling
    • Rescheduling – while delayed, find a non-executing part of the query plan and start it
    • Operator synthesis – when nothing can be rescheduled, might tinker with the original plan
    • Want to do this in a cost-based way
cost based query scrambling
Cost-Based Query Scrambling
  • Divide plan into runnable subtrees based on schedule; schedule those with 75% efficiency
    • Each may be run out of order if it materializes
    • Cost = mat. write + processing + mat. read
    • Efficiency = (M – MR) / (P + MW): savings / cost
  • When no more runnable subtrees, need to do something
    • Operator synthesis: try to find a computation that will mask the delay – only how much should we do?
    • PAIR: only do a single join over a pair of relations
    • IN: include delay, chooses to defer delayed access to end
    • ED: estimated delay, progressively increases the delay
overall assessment
Overall Assessment
  • Hard to estimate how much work to do!
  • A trade-off between aggressive and conservative strategies
    • If enough resources, it doesn’t matter…
solution 2 pipelined hash joins
Solution 2: Pipelined Hash Joins
  • They can dynamically adapt to delays, alleviating the need for techniques like query scrambling (except between different pipeline stages)
  • Disadvantages:
    • Memory – what happens if we run out
      • Focus of [Ives+99], [Urhan+00]
    • Cost – a query plan with sub-optimal order will still be very expensive!
  • The central piece of the next two techniques we’ll look at…
extreme adaptivity eddies
Extreme Adaptivity: Eddies
  • The basic idea:
    • Query processing consists of sending tuples through a series of operators
    • Why not treat it like a routing problem?
    • Rely on “back-pressure” (i.e., queue overflow) to tell us where to send tuples
  • Part of the ongoing Telegraph project at Berkeley
    • Large-scale federated, shared-nothing data stream engine
    • Variations in data transfer rates
    • Little knowledge of data sources
telegraph architecture
Telegraph Architecture
  • Simple “pre-optimizer” to generate initial plan
    • Creates operators, e.g.:
      • Select: predicate(sourceA)
      • Join: predicate(sourceA, sourceB)
      • (No support for aggregation, union, etc.)
    • Chooses implementations

Generally relies on pipelined hash joins

  • Goal: dataflow-driven scheduling of operations
    • Tuple comes into system
    • Adaptively routed through operators in “eddies”

May be combined, discarded, etc.

the eddy
The Eddy
  • Represents set of possible orderings of a subplan
  • Each tuple may “flow” through a different ordering (which may be constrained)
  • N-ary module consisting of query operators
    • Basic unit of adaptivity
    • Subplan with select, project, join operators

U

example join subplan alternatives
Example Join Subplan Alternatives

Join(R3R1.x = R2.x, JoinR1.x = R3.x(R1, R3))

R1.x = R3.x

R1.x = R3.x

R1

R2.x = R3.x

R3

R2.x = R1.x

R2

R3

R2

R1

R1.x = R3.x

R1.x = R3.x

R3

R2.x = R3.x

R1

R2.x = R3.x

R1

R2

R2

R3

na ve eddy
Naïve Eddy
  • Given tuple, route to operator that’s ready
    • Analogous to fluid dynamics
    • Adjusts to operator costs
    • Ignores operator selectivity
lottery based eddy
Lottery-Based Eddy
  • Need to favor more selective operators
    • Ticket given per tuple input, returned per output
    • Lottery scheduling based on number of tickets
    • Now handles both selectivity and cost
enhancing adaptivity sliding window
Enhancing Adaptivity: Sliding Window
  • Tickets were for entire query
  • Weighted sliding window approach
    • “Escrow” tickets during a window
    • “Banked” tickets from a previous window
what about delays
What about Delays?

Problems here:

Don’t know when join buffers vs. “discards” tuples

T

S

R (SLOW)

eddy pros and cons
Eddy Pros and Cons
  • Mechanism for adaptively re-routing queries
    • Makes optimizer’s task simpler
    • Can do nearly as well as well-optimized plan in some cases
    • Handles variable costs, variable selectivities
  • But doesn’t really handle joins very well – attempts to address in follow-up work:
    • STeMs – break a join into separate data structures; requires re-computation at each step
    • STAIRs – create intermediate state and shuffle it back and forth
  • Pre-optimization? Distribution (DeWitt et al.)? Beyond joins?
generalizing adaptive query processing
Generalizing Adaptive Query Processing
  • We’ve seen a range of different adaptive techniques
  • How do they fit together?
  • Can we choose points between eddies and mid-query re-optimization?
types of adaptive query processing
Types of Adaptive Query Processing

Adaptive scheduling (q. scrambling [UF98], dyn. pipeline sched. [UF01], XJoin [UF00], PH Join [RS86][I+99], ripple join [HH99])

Changes CPU scheduling to improve feedback or reduce delays

Redundant computation (competitive exec. [AZ96])

Compare two+ ways of executing the query

Plan partitioning ([S+76][KD98][I+99][M+04])

Break the plan into stages; re-optimize future stages as necessary

Adaptive info passing ([IT05 sub.])

Pass data between parts of an executing plan

Adaptive data partitioning

  • Break the data into subsets; use a different plan for each subset
  • The only way to reduce overall computation with fine granularity
  • First (only) implementation has been eddies[AH00][R+03][DH04]
eddies combine adaptive scheduling and data partitioning decisions
Eddies Combine Adaptive Scheduling and Data Partitioning Decisions

Intuitively, each tuple gets its own query plan

  • Route to next operator based on speed and selectivity of each operator
  • Elegant and simple to implement

But performing a join creates subresults at the next level!

Local & greedy choices may result in state that needs to join with all future data!

Consider long-term effects of decisions before making them – separate CPU scheduling from plan selection

focusing purely on adaptive data partitioning
Focusing Purely on Adaptive Data Partitioning

Use adaptively scheduled operators to “fill CPU cycles”

Now a query optimizer problem:

Choose a plan that minimizes long- term cost (in CPU cycles)

To allow multiple plans, distribute union through join (and select, project, etc.):

If R1 = R11[ R12, R2 = R21[ R22 then:

R1⋈ R2 = (R11[ R12) ⋈ (R21[ R22) = (R11⋈ R21) [ (R12⋈ R22) [

(R11⋈ R22) [ (R12⋈ R21)

R11

R21

R22

R12

R2

R1

This generalizes to njoins, other SPJ + GUoperators…

adaptive data partitioning routing data across different plans

Exclude R0S0T0,R1S1T1

R1 S1T1

S1T1

È

T

0

T

0

1

1

T

R

R

S

0

0

Exclude R0S0

0

0

1

1

R

S

S

T

R

0

S

0

R

S

1

1

Adaptive Data Partitioning:Routing Data across Different Plans

R ⋈S ⋈ T

R0 S0T0

Options for combining across phases:

  • New results always injected into old plan
  • Old results into new plan
  • Wait until the end – “stitch-up” plan based on best stats

R0 S0

R

S

T

special architectural features for adp
Special Architectural Features for ADP

Monitoring and re-optimization thread runs alongside execution:

  • System-R-like optimizer with aggregation support;uses most current selectivity estimates
  • Periodic monitoring and re-optimization revises selectivity estimates, recomputes expected costs

Query execution with “smart router” operators

Special support for efficient stitch-up plans:

  • Uses intermediate results from previous plans (specialized-case of answering queries using views [H01])
  • Join-over-union (“stitch-up-join”) operator that excludes certain results
adp application 1 correcting cost mis estimates
ADP Application 1:Correcting Cost Mis-estimates

Goal: react to plans that are obviously bad

  • Don’t spend cycles searching for a slightly better plan
  • Try to avoid paths that are likely to not be promising

Monitor/reoptimizer thread watches cardinalities of subresults

  • Re-estimate plan cost, compare to projected costs of alternatives, using several techniques & heuristics (see paper)
  • Our experiments: re-estimate every 1 sec.

“Smart router” operator does the following:

  • Waits for monitor/reoptimizer to suggest replacement plan
  • Re-routes source data into the new plan
  • New plan’s output is unioned with output of previous plan; this is fed into any final aggregation operations
correcting for unexpected selectivities
Correcting for Unexpected Selectivities

Pentium IV 3.06 GHzWindows XP

adp application 2 optimizing for order
ADP Application 2:Optimizing for Order

Most general ADP approach:

  • Pre-generate plans for general case and each “interesting order”
  • “Smart router” sends tuple to the plan whose ordering constraint is followed by this tuple
  • But with multiple joins, MANY plans

Instead: do ADP at the operator level

  • “Complementary join pair”
  • Does its own stitch-up internally
  • Easier to optimize for!

Can also do “partial sorting” at the router (priority queue)

Q

...

Merge

Hash

h(R)

h(S)

h(R)

h(S)

Q

Q

Q

Q

Routers

R

S

exploiting partial order in the data
Exploiting Partial Order in the Data

Pentium IV 3.06 GHzWindows XP

(1024 tuple)

adp over windows optimizing for aggregation

SUM(T.y)

GROUP BY T.x

T

R

SUM(T.y sums)

GROUP BY T.x

R

SUM(T.y)

GROUP BY

T.x, T.joinAttrib

T

ADP Over “Windows”:Optimizing for Aggregation
  • Group-by optimization [CS94]:
    • May be able to “pre-aggregate” some tuples before joining
    • Why: aggregates can be applied over union
  • But once we insert pre-aggregation, we’re stuck (and it’s not pipelined)
  • Our solution:
    • “Adjustable window pre-aggregation”
    • Change window size depending on how effectively we can aggregate
    • Also allows data to propagate through the plan – better info for adaptivity, early answers

vs.

adaptive qp in summary
Adaptive QP in Summary

A variety of different techniques, focusing on:

  • Scheduling
  • Comparison & competition
  • Information passing
  • Data + plan partitioning

A field that is still fairly open – missing:

  • Effective exploration methods
  • A true theory!
    • What’s possible? What kinds of queries make sense to adapt?
  • Guarantees of optimality and convergence (perhaps under certain assumptions)
ad