Online aggregation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 51

Online Aggregation PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

Online Aggregation. Joseph M. Hellerstein Peter J. Haas Helen J. Wang. Motivation for Online Aggregation. Traditional aggregation takes a long time to return a very small final result from a large amount of data The result does not have to be very precise!

Download Presentation

Online Aggregation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Online aggregation

Online Aggregation

Joseph M. Hellerstein

Peter J. Haas

Helen J. Wang


Motivation for online aggregation

Motivation for Online Aggregation

  • Traditional aggregation takes a long time to return a very small final result from a large amount of data

  • The result does not have to be very precise!

  • Online aggregation allows users to observe the progress of their queries and control execution on the fly


New interface for aggregation

New interface for aggregation

  • Observe the progress of their queries

  • Control execution on the fly


Statistical estimation

Statistical estimation

  • Users do not need to set a priori specification of stopping condition

  • The interface is easier for users with no statistical background

  • It requires more powerful statistical estimation techniques (Hoeffding’s inequality versus Chebyshev’s inequality)


Related work

Related work

  • Online Analytical Processing (OLAP): batch mode and precise computation

  • “fast first” query processing, similar to Online Aggregation but has simpler performance goals

  • APPROXIMATE:defines an approximate relational algebra used to process standard relational queries in an iteratively refined manner


Usability goals

Usability goals

  • Continuous observation

  • Control of time/precision

  • Control of fairness/partiality


Performance goals

Performance goals

  • Minimum time to accuracy: produce a useful estimate of the final answer ASAP

  • Minimum time to completion: secondary goal, assume user will terminate processing long before the final answer is produced

  • Pacing: guarantee a smooth and continuous improving display


Online aggregation

A naïve approach

  • Use user-defined output functions supported by POSTGRES to provide simple running aggregates

  • Can not support complex aggregates like aggregation with GROUP BY

  • Current systems optimize only for accurate solutions

  • Skip factor k – the DBMS only ships an update to the user interface after k input tuples have been processed


Random access to data

Random access to data

We need to retrieve data in random order to produce meaningful statistical estimation. Three ways to get records in random order:

  • Heap scans

  • Index scans

  • Sampling from indices (less efficient)


Non blocking group by and distinct

Non-blocking GROUP BY and DISTINCT

  • Sorting is a blocking algorithm and only one group is computed at a time after sorting

  • Hashing is non-blocking, but hash table need to fit in memory to have good performance

  • Hybrid Cache (an extension of hybrid hashing) might be good


Index striding

k1

k2

k3

Index striding

  • Hash-based grouping can be unfair

  • Solution: probe the index to find all the groups and then process tuples from each group in a “round robin” fashion

  • Can control speed by weighting the schedule

  • Fair for groups with different cardinality


Index striding continued

Index striding - Continued

  • Efficiency: will be as good as scanning a relation via a clustered secondary index if

    • Index is the primary access method or

    • Relation is clustered by the grouping columns or

    • Index keys contain both the grouping and aggregation columns, with the grouping columns as a prefix


Non blocking join algorithms 1

Non-blocking join algorithms (1)

  • Sort-merge join is not acceptable for online aggregation because sorting is blocking

  • Hybrid hash join blocks for the time required to hash the inner relation

  • Pipeline hash join techniques may be appropriate for online aggregations when both relations are large

  • Merge join (without sort) and hash join provide output with orders – not good for statistic estimation


Non blocking join algorithms 2

Non-blocking join algorithms (2)

  • The “safest” traditional join algorithm is nested loop, particularly if there is an index on the inner relation

  • More on ripple join coming next…


Issues in optimization 1

Issues in optimization (1)

  • Avoid sorting completely

  • It is undesirable to produce results ordered on aggregation or grouping columns

  • Divide cost model into two parts:

    • Time td spent in blocking operations

    • Time to spent producing output

      Use cost function: f(to) + g(td) (where f is linear and g is super-linear) to “tax” operations with too much dead time


Issues in optimization 2

Issues in optimization (2)

  • Preference to plans that maximize user control (such as index-striding)

  • Trade off between output rate of a query and its time to completion

    • Create natural controls in this regard for naïve users

    • Run multiple versions of a query is a solution but will waste computing resources


Running confidence intervals 1

Running confidence intervals (1)

  • Confidence parameter p(0,1) is prespecified

  • Display a precision parameter єn such that running aggregate Yn is within  єn of the final answer μ with probability approximately equal to p. [Yn- єn,Yn+ єn] contains μ with probability approximately equal to p


Running confidence intervals 2

Running confidence intervals (2)

  • Three types to contruct from n retrieved records:

    • Conservative confidence intervals based on Hoeffding’s inequality or recent extention of this inequality, for all n>=1

    • Large-sample confidence intervals based on central limit theorems (CLT’s), for n both small and large enough

    • Deterministic confidence intervals contain μ with probability 1, only for very large n


Running confidence intervals 3

Running confidence intervals (3)

  • SELECT AVG(exp) FROM R;

    v(i) (1  i  m): the value of exp when applied to tuple i

    Li: the random index of the ith tuple retrieved from R

    a and b are a priori bounds a  v(i)  b for 1  i  m

  • Conservative confidence interval equations:


Running confidence intervals 4

Running confidence intervals (4)

  • Large-sample confidence interval equations

  • By central limit theorems (CLT’s) , Ynapproaches a normal distribution with a mean (m) and a variance s2/n as n, the sample size, increases. s2can be replaced by the estimator Tn,2(v)


Performance issues skip factor

Performance issues – skip factor


Performance issues index striding a large group

Performance issues – index striding (a large group)


Performance issues index striding a small group

Performance issues – index striding (a small group)


Conclusion

Conclusion

  • aggregates have running output and confidence interval

  • hash-based grouping and duplicate-elimination is not blocking

  • index striding gives more control over GROUP BY


Future work

Future work

  • Graphical user interface

  • Nested queries

  • Checkpointing and continuation


Ripple joins for online aggregation

Ripple Joins for Online Aggregation

Peter J. Haas

Joseph M. Hellerstein


Join algorithms for online aggregation

Join algorithms for online aggregation

  • Sort-merge and hash join algorithms are blocking algorithms – not acceptable in online aggregation

  • Merge (without sorting) provides ordered output – bad for statistical estimator in online aggregation

  • Nested loop join is the best, but...


An artificial bad example for nested loop join

An artificial bad example for nested loop join

  • SELECT AVG(S.a + R.b/10000000) FROM R, S

    If R is the inner relation, for each tuple from S, we need to scan the whole relation R. But the scan does not provide much information for the output at all.


Overview of ripple join 1

Overview of ripple join (1)


Overview of ripple join 2

Overview of ripple join (2)

  • online nested-loops join is a special case of ripple join


Ripple join algorithms 1

R

R

R

n-1*n-1

n-1*n-1

n-1*n-1

S

S

S

Ripple join algorithms (1)

  • It can be viewed as a generalization of nested-loops join in which the traditional roles of “inner” and “outer” relation are continually interchanged during processing


Ripple join algorithms 2

Ripple join algorithms (2)

  • We need to modify the iterator for our algorithm – have to keep track of more states

  • Aspect ratios does not have to be 1

  • It can be extended to multiple ripple joins


Ripple join variants

Ripple join variants

  • Block ripple join improves performance on I/O

  • Indexed ripple join = index-enhanced nested-loops join, the role of inner/outer relations does not alternate any more

  • Hash ripple join can be used for equijoin queries – two hash tables in memory for both R and S


Estimators for sum count and avg

Estimators for SUM, COUNT and AVG


Confidence intervals

Confidence intervals

  • Use central limit theorems (CLT’s) to compute “large-sample” confidence intervals

  • Fix the problems in classic CLT’s with newly defined 2 for different aggregate queries


Ripple optimization choosing aspect ratios 1

Ripple optimization:choosing aspect ratios (1)

  • Blocking factor  is prespecified, we want to optimize k’s – the aspect-ratio parameters

  • minimize

such that

12 3 ...KK-1c (decided by animation speed)

1  k  mk/  for 1  k  K

1,2 ,3 ,...K interger


Choosing aspect ratios 2

Choosing aspect ratios (2)

  • Solve relaxed version of the optimization problem by droping all constraints other than the first one

  • Adjust the ratios accordingly during execution to satisfy the other constraints

  • Starting ripple join with all =1 and start updating the aspect ratio after certain amount of time


Performance 1

Performance (1)

  • SELECT ONLINE_AVG(enroll.grade)

    FROM enroll, student

    WHERE enroll.sid = student.sid

    AND student.honors_code IS NULL;

  • Hash ripple join gives tight confidence interval within seconds

  • Nested lopps join takes over 10 seconds to begin producing output

  • Best batch join algorithm (hybrid hash join) takes 208 seconds to complete


Performance 2

Performance (2)


Performance 3 choosing aspect ratios

Performance (3) – choosing aspect ratios

  • Select a lower animation speed to allow all kinds of aspect ratios

  • Query:

    SELECT ONLINE_AVG(d.grade/a.grade)

    FROM enroll d, enroll a

    WHERE d.college = “Education”

    AND a.college = “Agriculture”

    AND a.year = d.year;


Performance 4 choosing aspect ratios

Performance (4) – choosing aspect ratios


Conclusion1

Conclusion

  • The bigger the database, the more attractive online join algorithms appear

  • User can control animation speed to trade off between precision and updating speed

  • The system will adjust aspect ratios accordingly to achieve the best updating performance


Future work1

Future work

  • Optimization for ripple joins – choice between many variants of ripple joins

  • Parallelization of ripple joins


Online dynamic reordering for interactive data processing

Online Dynamic Reordering for Interactive Data Processing

Vijayshankar Raman

Bhaskaran Raman

Joseph M. Hellerstein


Motivating applications

Motivating applications

  • Unclustered index striding in online aggregation might be very inefficient because of random I/Os

  • Scalable spreadsheets want to provide services like sorting, scrolling and jumping instantaneously

  • Sort algorithm used in some query plans is not the best solution


The prefetch and spool p s

The prefetch and Spool (P&S)

  • Use the time a process takes to fetch interesting tuples to the buffer

  • Ratios of different groups in the buffer depends on user’s interest

  • Spools tuples (in chunks) that are not very interesting onto side-disk

  • Phase 1 scans from the input and Phase 2 scans from the side-disk with certain order


Policies for online reordering

Policies for online reordering

  • User preference and metric (confidence, rate and strict) decide delivery priority of different groups

  • We want to maximize the rate at which the feedback function rises as the number of tuples we retrieve increases


Performance 11

Performance (1)

  • Compared to Index Striding

    • Much better than unclustered index striding because of less random I/Os

    • For extremely small groups, IS might be better

  • Works better for slower processing rate because P&S has more time to build up the buffer between “gets”


Performance 21

Performance (2)

  • Works well for scalable spreadsheets

  • After P&S finishes Phase 1, retrieve tuples at interesting place in milliseconds while a blocking sort takes 15 minutes

  • Random jump is not as good as short jump because nearby data has higher priority (using strict metric)


Performance 3

Performance (3)

  • Can be used to replace sort in some traditional query plans

  • Exact ordering is not needed in some query plans and reordering can provide non-blocking service which is good for pipelined parallelism and interactive estimation techniques like online aggregation


Future work2

Future work

  • Other feedback functions for different scenario

  • providing data for multiple users

  • Usages in graphical data visualization


  • Login