Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse.

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. Authors : H. V. Jagadish Laks V. S. Lakshmanan Divesh Srivastava Presented by: Ittay Freiman

The Problem • Disk I/O is a major parameter in the cost of a (‘select-where’) query. • Physical layout of records impacts how much I/O needs to be done. • Fragmentation of results costs more. • No linear ordering of records can be optimal for any query.

Glossary • Grid query - a query according to hierarchies.

Linear Clustering & Workload • Linear clustering - the ordering of records on disk. • Workload - frequencies on the different types (classes) of queries. P1 P2 Hilbert

The Problem - Examples Expected cost (<total cost of all queries>/<queries #> Relative costs (<best total cost> / <worst>)

Agenda • Introduction & definitions. • Lattice, lattice paths. • Cost of a lattice path. • Algorithm to find an optimal lattice path. • Snaking. • Experiments.

Definitions • Query class - a vector (v1,…,vk) of level numbers in a grid. • (0,..,0) - All queries that return a single cell. • (1,..,1) - All queries returning 2x2..x2 “block” • Etc. • For example: • Q1 is from the (1,1) class • Q2 is from the (2,1) class

Definitions • (v1,..,vk) <=def (u1,..,uk)  v1<= u1,.., vk<= uk. • (1,0) <= (1,0) , (1,1) , (2,1) • (1,0) ? (0,1) - not known • (0,..,0) - lower bound, (u1max,..,ukmax) - upper bound. • D-successor - (u1,..,uk) is a d-successor of (v1,..,vk) if (v1,., vi+1,..,vk) = (u1,..,uk) • (1,1) is a d-successor of (1,0)

Lattice path • Sequence of vectors, starting from the lower bound and ending in the upper bound when each vector is a d-successor of its previous.

(2,2) - all the grid (1,2) (2,1) (2,0) (0,2) (1,1) (1,0) (0,1) (0,0) - one cell Example • A 2D grid with binary hierarchies of 3 levels each: {(0,0),(0,1),(0,2),(1,2),(2,2)} example example

{(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} • • • • • • • • {(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} • • • • • • • • {(0,0),(0,1),(0,2),(1,2),(2,2)} • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Clustering via lattice paths • Go through the path. Each entry is a query class. For each query in the current entry, cluster the cells so they’ll be continues. 2 1 0

Cost of a lattice path • The cost of a query (class) is the number of continues segments of the records answering it on disk. • The cost of a query that is in on the lattice path is 1. • More formally: • is a d-successor of . the average fanout in dimension d, at level i (the dimension and level in which differ). • P a lattice path, any point on P.

• • • • • • • • • • • • Example (Fanout = 2 in each level) (2,2) dist((2,1)) = 4 (1,2) (2,1) 2 (2,0) (0,2) 2 (1,1) (1,0) (0,1) 2 dist((1,0)) = 2 (0,0) • • • •

Cost of a lattice path • The weighted average of the costs of each query class: • Optimal lattice path = lowest cost.

Algorithm • For a 2D grid. • If is optimal, so is • General description: • Initialization • For (i,j) = (m,n) to (0,0) do • Compute the cost for (i,j) as the min: • when going through (i+1,j) • when going through (i,j+1) • Take the best path through (i,j) accordingly • In the end, the path through (0,0) is the optimal.

Algorithm • Time complexity O((m+1)(n+1))=O(mn) • Space complexity O(mn) • The algorithm works without regard to the fanout (only the cost is influenced) • For unbalanced hierarchies we add dummy nodes.

• • • • • • • • • • • • Snaking • Reversing the clustering order of alternate queries in the clustering. • Note - every adjacent cells on disk differ only in one dimension Revs. Double Reversing =As is • • • • • • • • Revs. • • • • As is Revs.

Improvement by Snaking • Snaked paths have no diagonal edges. • Edge - Two adjacent points in the clustering path. • Edge Type - The dimensions in which the end-points differ. • Non-Diagonal Edge - The points differ in one dimension only. • There is a snaked lattice path that has optimal cost over all possible clusterings.

Limits & Guarantees • For any workload. The result of snaking the optimal lattice path lowers the cost by a factor of less than 2. • A optimal lattice path that has been ‘snaked’ is at most twice worse than the global optimal clustering (for a specific workload).

Experiments • LineItem table from the TPC-D benchmark. • 3 dimensions: parts, supplier and time. • Record size 125 bytes. Page size 8K. • The results counted the number of pages and seeks for a set of queries with a given workload. • The results were normalized by the minimum amount of pages needed. • 27 workloads were tested. • Comparison with the 6 row major strategies

Results

Conclusions (from the article) • Physical clustering is crucial to data warehouse performance • Lattice paths can improve performance. • With respect to the Hilbert clustering, lattice path clustering performs sometimes better. • In a follow up paper: Hilbert is sandwiched between 2 snaked lattice paths, for a workload. • Snaking always reduces the cost. • There’s a snaked path which is globally optimal. • The snaked optimal lattice path is at most twice worst of the optimal clustering strategy.

Remarks • How can we apply this to grids which contain dynamic lists in their records? • Lack of clarity and formalism in the proofs (some not published yet). • More work needs to be done to get meaningful improvements.

Proofs Sketch • Define a clustering by a vector of edge types & number (<8,4;2,1>). • Show constraints on such a vector. - consistent vector. • Define an ordering on these vectors (or any), with a minimal definition. • Show that every minimal vector (with no diagonal edges & powers of two in the numbers is a snaked lattice path • Extend the definition of cost to arbitrary constrainded vectors. • Show that for every ‘diagonal strategy’ there exists a non-diagonal constrained vector with lower cost.

Proofs Sketch • Show that for every workload there exists a snaked lattice path such that its cost is minimal. • Show that for any workload. The result of snaking the optimal lattice path lowers the cost by a factor of less than 2. • Show that for any workload the optimal snaked lattice path improves the snaked optimal lattice path by a factor of less than 2.

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse.