CS511 Design of Database Management Systems

CS511Design of Database Management Systems Lecture 19: Data Cube Kevin C. Chang

Central Notion: Aggregate • Aggregate functions: • COUNT, SUM, AVG, MAX, MIN, MaxN, MinN • select SUM(sales) from Sales group by timeID • ?? General definition for “aggregate” functions? • some sort of summarization • ?? Why is it always associated by Group-by?

Aggregate • Aggregate definition: A function that • maps a set of values into a single value • Algorithm Skeleton: • Defining new aggregate: Init, Iter, Final • ?? How to implement the methods for Average? Max? Aggregate(set VS): Init(&handle); for v in VS: Iter(&handle, v); result = Final(&handle); return result;

Aggregate: Implementing Average Average(set VS): Init(&handle); for v in VS: Iter(&handle, v); result = Final(&handle); return result; Average(set VS): sum = 0; count = 0; for v in VS: count = count + 1; sum = sum + v; result = sum/count; return result;

Before Cube: Grouping and Aggregates • Aggregate by groups with Group-by • select sum(sales) from Sales group by timeID • ? Problems?

MOLAP, ROLAP, HOLAP • MOLAP: • actually stores things in multi-d array format • special index structures • much of the cube is empty! • essentially everything is pre-computed • ROLAP: • uses relational tables and engines • with lots of pre-computation (derived data) • HOLAP: • hybrid • multi-d arrays as ADT (abstract data type) extension

Location Sales y x milk soda eggs soap Product 1 2 3 4 Time Relational OLAP: Requirements • Cube modeling: • how to model a multidimensional cube as a relation • Cube operations: • what language primitives for cubes? • how to efficiently compute cubes? measure dimensions Sales fact table prodID timeID locID sales a 1 x 10 a 2 x 20 b 2 x 30

CUBE Operator: Starting from Core • Input: A “core” relation (the “basic” cube) • a fact table, e.g.: Sales(prodID, timeID, locID, sales) • core size = C1´ ¼´ Cn where |Dimensioni| = Ci • typically generated by a Group-by on dimensions (“group-by core”) • thus all dimensions together determine a measure prodID timeID locID sales a 1 x 10 a 2 x 20 b 2 x 30 c 2 x 40 b 3 x 50 a 1 y 72 40 prodID a b c 30 50 x y locID 10 20 1 2 3 timeID

CUBE Operator: Forming a Cube • Add “ALL” (*) to every specified dimension • cube size = (C1 +1)´ ¼´ (Cn + 1) • measure value of cell with * is an aggregate • e.g.: SUM(*, 1, x) = 10, SUM(*, *, x) = 150, SUM(*, *, *) = 222 222 prodID timeID locID sales a 1 x 10 a 2 x 20 b 2 x 30 c 2 x 40 b 3 x 50 a 1 y 72 * 1 x 10 a * x 30 … … … … … … … … … *** 222 10 90 50 150 40 40 prodID a b c * 30 50 80 x y * locID 10 20 30 1 2 3 * timeID

Cubes: Unifying Aggregates • 0-D: overall aggregate • 1-D: single group-by • 2-D: cross-tabulation • n-D: cube of all sub-aggregates Cube: core plus all aggregates as points, lines, planes, cubes, …

Cube: Relation with Symmetric Aggregates • Successfully modeled as a relation • What are in a cube: • Histograms: • a cell in cube is “statistics” (aggregate) of the category • Subtotal: aggregate at all levels • cube contains all subtotals (on any set of dimensions) • e.g.: SUM(*, *, x) = 150: subtotal on prodID and timeID • Cross tabulation: • cube is a n-dimensional cross-tabulation

Cube Representations: Alternatives?# As in [Date, 1996]: Model Year Color (M,Y,C) (M,Y,*) (M,*,*) Chevy 1994 Black 50 90 290 Chevy 1994 White 40 90 290 Chevy 1995 Black 85 200 290 Chevy 1995 White 115 200 290 Q: I had several questions on the logic of the authors, when they tried to explain why their new implementation was better than pivot tables or Date's suggestion page 558. It seemed to me that Date's suggestion was more compact than the pivot table and that the pivot table more compact than their (table 6) suggestion using the ALL. That is while they were shortening the length of the columns, the length of the rows grow a great deal! What do you think?

Cube Representations: [Date, 1996] • Problems: • lots of columns ( n dimensions  2n more columns) • redundant values (same aggregates repeated) • Growth in columns vs. in rows • relational DB designed to handle growing num. of rows • ? compare “table growth” by delta-column vs. delta-row? As in [Date, 1996]: Model Year Color (M,Y,C) (M,Y) (M) Chevy 1994 Black 50 90 290 Chevy 1994 White 40 90 290 Chevy 1995 Black 85 200 290 Chevy 1995 White 115 200 290

?? Cube: Any Problem? • ?? Problems? • what do we pay for modeling cube as relation? • does a cube contains all sorts of aggregates?

Cube: Problems • “All” can be a headache! (special handling!) • represents a set value • change all scalar values to a set of single element? • context-sensitive interpretation: All = All? • just like Null; maybe worse (at least Null=Null?) • All-or-nothing aggregate # • no granularity of aggregate scopes • e.g.: aggregate of some (but not all) products • use materialized views for such “some” aggregates

ROLLUP Operator: Asymmetric Aggregates • Cube can be overkill sometimes • cube: symmetric aggregates of all combinations • ? example that you don’t want cube? • Rollup v1, …, vn: • progressively aggregate over vn, …, v1 • ? how many new aggregate tuples? prodID timeID locID sales a 1 x 10 a 2 x 20 b 2 x 30 c 2 x 40 b 3 x 50 a 1 y 72 a 1 * 82 a 2 * 20 … … … … … … … … … a ** 102 … … … … … … … … … *** 222

Computing Cubes: 2N-Algorithm • Compute each aggregate cell independently • aggregate cells: cells with an “*”: e.g., SUM(*, 1, x) • ?? Sample algorithm: Any improvement? Average(set VS): Init(&handle); for v in VS: Iter(&handle, v); result = Final(&handle); return result; Cube(coreTable F): for all aggCells C: Init(&handle-C); for tuple t in F: for all aggCells C: Iter(&handle-C, t); for all aggCells C: value-C = Final(&handle-C); cube = F + all aggCells; return cube;

Computing Cubes: 2N-Algorithm • Update each aggCell independently by intersecting tuples • Tuple (x1, …, xn, v) ==> intersects (2n-1) aggCells ==> |F|´(2n-1) Iters • minus 1 because we assume (x1, …, xn, v) in core already computed • e.g.: Sales(a, 1, x, value) ==> 7 cells: (a, 1, x), (a, 1, *), (a, *, x), etc • ?? Any problem for further improvement? Say, for computing SUM? Cube(coreTable F): for all aggCells C: Init(&handle-C); for tuple v in F: for all intersecting aggCells C: Iter(&handle-C, v); for all aggCells C: value-C = Final(&handle-C); cube = F union all aggCells; return cube; (2n-1) aggCells

?? 2N-Algorithm: Problem • Computing a cube of SUM: • core: • cube: • Total Iters: (2 ´ 3) ´ (22-1) =18 • (1, a) invokes Iter on (1,*), (*, 1), (*, *) • Each aggCell determined by a subset of core • (1, *): by (1, a), (1, b), (1, c) • (*, *): by (1, a), (1, b), (1, c), (2, a), (2, b), (2, c) • ?? What can be saved? a b c 1 1020 30 2 40 50 60 a b c * 1 1020 3060 2 40 50 60150 *50 70 90210

2N-Algorithm: Problem • Computing a cube of SUM: • core: • cube: • Total Iters: (2 ´ 3) ´ 2 + 2 =14 • (1, a) invokes Iter on (1,*), (*, 1) • (*, *): by (1, *), (2, *) • Saving of Iters = 18 - 14 = 4 a b c 1 1020 30 2 40 50 60 a b c * 1 1020 3060 2 40 50 60150 *50 70 90210

Distributive Aggregates • An aggregate function F is distributive if • Examples: • SUM: F = SUM, G = SUM • COUNT: F = COUNT(), G = SUM • How many “Iters” do you save? • for total aggregate (*, …, *): m vs. (m´n) x11x12 … … … x1n x21x22 … … … x2n … … … xm1xm2 … … … xmn F(x11x12 … … … x1n) F(x21x22 … … … x1n) … … … F(xm1xm2 … … … xmn) F( ) = G( )

Algebraic Aggregates • An aggregate function F is algebraic if where G returns a M-tuple, for a constant M • Examples: • Average: G returns (sum, count), H = total-sum/total-count • Otherwise: holistic • typically need to return the whole subset x11x12 … … … x1n x11x12 … … … x1n x21x22 … … … x2n … … … xm1xm2 … … … xmn G(x11x12 … … … x1n) G(x21x22 … … … x1n) … … … G(xm1xm2 … … … xmn) F( ) = H( )

Computing Cube Efficiently# • Much research done and going on • how to build super-agg on sub-agg? • how to compute large cube in memory? • Ref: Zhao et al. SIGMOD 97 (3rd paper in chap) • MOLAP approach: Cube as multi-d arrays • chunk arrays so that each chunk fits in memory • aggregate computation on “subsumption lattice” • algorithm = “minimal-memory” spanning tree A B C A B A C B C A B C ALL

What’s Next • Association-rule mining: problem and solution

Cube: All Sub-Aggregates The Data Cube and The Sub-Space Aggregates FORD 1990 CHEVY 1991 1992 1993 By Year By Make By Make & Year RED WHITE BLUE By Color & Year By Make & Color Sum By Color

CS511 Design of Database Management Systems