Data Aggregation Operators: A Comprehensive Overview

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray Microsoft Adam Bosworth Microsoft Andrew Layman Microsoft Hamid Pirahesh IBM Presented by: Changwu Li Based on notes by Jim Gray

Spread Sheet Extract analyze Table Price vs Speed Size vs Speed 15 4 10 10 Cache Nearline Offline 1 Tape 12 2 Main Tape 10 10 Secondary Disc $/MB Size(B) Online Online Secondary 9 0 Tape 10 10 Tape Disc visualize Main Offline Nearline Tape Tape 6 -2 10 10 Cache 3 -4 10 10 -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds) The Data Analysis Cycle • User extracts data from database with query • Then visualizes, analyzes data with desktop tools

Relational Aggregate Operators • SQL has several aggregate operators: • sum(), min(), max(), count(), avg() • Other systems extend this with many others: • stat functions, financial functions, ... • The basic idea is: • Combine all values in a column into a single scalar value. • Syntax select sum(units) from inventory;

Relational Group By Operator • Group By allows aggregates over table sub-groups • Result is a new table • Syntax: select deptno, sum(salary) from emp group by deptno

sum M T W T F S S AIR HOTEL FOOD MISC Problems With This Design • Users Want Histograms • Users want sub-totals and totals • drill-down & roll-up reports • Users want CrossTabs • Conventional wisdom • These are not relational operators • They are in many report writers and query engines F() G() H()

Table 5a: Ford Sales Cross Tab Ford 1994 1995 total (ALL) black 50 85 135 white 10 75 85 total (ALL) 60 160 220 A cross tab example

How to solve this problem? Answer: cube

The Idea: Think of the N-dimensional CubeEach Attribute is a Dimension • N-dimensional Aggregate (sum(), max(),...) • fits relational model exactly: • a1, a2, ...., aN, f() • Super-aggregate over N-1 Dimensional sub-cubes • ALL, a2, ...., aN , f() • a3 , ALL, a3, ...., aN , f() • ... • a1, a2, ...., ALL, f() • this is the N-1 Dimensional cross-tab. • Super-aggregate over N-2 Dimensional sub-cubes • ALL, ALL, a3, ...., aN , f() • ... • a1, a2 ,...., ALL, ALL, f()

An Example CUBE

Why the ALL Value? • Need a new “Null” value (overloads the null indicator) • Value must not already be in the aggregated domain • Can’t use NULL since may aggregate on it. • Think of ALL as a token representing the set • All(color)={red, white, blue}, • All(year)={1990, 1991, 1992}, • All(model)={Chevy, Ford} • Follow “set of values” semantics.

CUBE operator: Syntax select model, make, year, sum(sales) from car_sales where model in {“chevy”, “ford”} and year between 1990 and 1994 group by model, make, year with cube having sum(sales) > 0; • Proposed syntax: • Note: Group By operator repeats aggregate list • in select list • in group by list

How To Compute the Cube? • If each attribute has Ni values CUBE has P (Ni+1) values • Compute N-D cube with hash if fits in RAM • Compute N-D cube with sort if overflows RAM • Same comments apply to subcubes: • compute N-D-1 subcube from N-D cube. • Aggregate on “biggest” domain first when >1 deep • Aggregate functions need hidden variables: • e.g. average needs sum and count.

Example: • Compute 2D core of 2 x 3 cube • Then compute 1D edges • Then compute 0D point • Works for algebraic and distributive functionsSaves “lots” of calls

Real world implementation • Both Oracle 9i and SQL server 2000 • An example in Oracle 9i: select deptno, job, sum(sal) as salary from emp group by cube(deptno, job)

DEPTNO JOB SALARY -------- ------------------ -------------------- 10 CLERK 1300 10 MANAGER 2450 10 PRESIDENT 5000 10 8750 20 ANALYST 6000 20 CLERK 1900 20 MANAGER 2975 20 10875 30 CLERK 950 30 MANAGER 2850 30 SALESMAN 5600 30 9400 ANALYST 6000 CLERK 4150 MANAGER 8275 PRESIDENT 5000 SALESMAN 5600 2902

1999 ACM Turing Award Jim Gray

Data Aggregation Operators: A Comprehensive Overview