1 / 27

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse.

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. Authors : H. V. Jagadish Laks V. S. Lakshmanan Divesh Srivastava Presented by: Ittay Freiman. The Problem. Disk I/O is a major parameter in the cost of a (‘select-where’) query.

norris
Download Presentation

Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. Authors : H. V. Jagadish Laks V. S. Lakshmanan Divesh Srivastava Presented by: Ittay Freiman

  2. The Problem • Disk I/O is a major parameter in the cost of a (‘select-where’) query. • Physical layout of records impacts how much I/O needs to be done. • Fragmentation of results costs more. • No linear ordering of records can be optimal for any query.

  3. Glossary • Grid query - a query according to hierarchies.

  4. Linear Clustering & Workload • Linear clustering - the ordering of records on disk. • Workload - frequencies on the different types (classes) of queries. P1 P2 Hilbert

  5. The Problem - Examples Expected cost (<total cost of all queries>/<queries #> Relative costs (<best total cost> / <worst>)

  6. Agenda • Introduction & definitions. • Lattice, lattice paths. • Cost of a lattice path. • Algorithm to find an optimal lattice path. • Snaking. • Experiments.

  7. Definitions • Query class - a vector (v1,…,vk) of level numbers in a grid. • (0,..,0) - All queries that return a single cell. • (1,..,1) - All queries returning 2x2..x2 “block” • Etc. • For example: • Q1 is from the (1,1) class • Q2 is from the (2,1) class

  8. Definitions • (v1,..,vk) <=def (u1,..,uk)  v1<= u1,.., vk<= uk. • (1,0) <= (1,0) , (1,1) , (2,1) • (1,0) ? (0,1) - not known • (0,..,0) - lower bound, (u1max,..,ukmax) - upper bound. • D-successor - (u1,..,uk) is a d-successor of (v1,..,vk) if (v1,., vi+1,..,vk) = (u1,..,uk) • (1,1) is a d-successor of (1,0)

  9. Lattice path • Sequence of vectors, starting from the lower bound and ending in the upper bound when each vector is a d-successor of its previous.

  10. (2,2) - all the grid (1,2) (2,1) (2,0) (0,2) (1,1) (1,0) (0,1) (0,0) - one cell Example • A 2D grid with binary hierarchies of 3 levels each: {(0,0),(0,1),(0,2),(1,2),(2,2)} example example

  11. {(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} • • • • • • • • {(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} {(0,0),(0,1),(0,2),(1,2),(2,2)} • • • • • • • • {(0,0),(0,1),(0,2),(1,2),(2,2)} • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Clustering via lattice paths • Go through the path. Each entry is a query class. For each query in the current entry, cluster the cells so they’ll be continues. 2 1 0

  12. Cost of a lattice path • The cost of a query (class) is the number of continues segments of the records answering it on disk. • The cost of a query that is in on the lattice path is 1. • More formally: • is a d-successor of . the average fanout in dimension d, at level i (the dimension and level in which differ). • P a lattice path, any point on P.

  13. • • • • • • • • • • • Example (Fanout = 2 in each level) (2,2) dist((2,1)) = 4 (1,2) (2,1) 2 (2,0) (0,2) 2 (1,1) (1,0) (0,1) 2 dist((1,0)) = 2 (0,0) • • • •

  14. Cost of a lattice path • The weighted average of the costs of each query class: • Optimal lattice path = lowest cost.

  15. Algorithm • For a 2D grid. • If is optimal, so is • General description: • Initialization • For (i,j) = (m,n) to (0,0) do • Compute the cost for (i,j) as the min: • when going through (i+1,j) • when going through (i,j+1) • Take the best path through (i,j) accordingly • In the end, the path through (0,0) is the optimal.

  16. Algorithm • Time complexity O((m+1)(n+1))=O(mn) • Space complexity O(mn) • The algorithm works without regard to the fanout (only the cost is influenced) • For unbalanced hierarchies we add dummy nodes.

  17. • • • • • • • • • • • Snaking • Reversing the clustering order of alternate queries in the clustering. • Note - every adjacent cells on disk differ only in one dimension Revs. Double Reversing =As is • • • • • • • • Revs. • • • • As is Revs.

  18. Improvement by Snaking • Snaked paths have no diagonal edges. • Edge - Two adjacent points in the clustering path. • Edge Type - The dimensions in which the end-points differ. • Non-Diagonal Edge - The points differ in one dimension only. • There is a snaked lattice path that has optimal cost over all possible clusterings.

  19. Limits & Guarantees • For any workload. The result of snaking the optimal lattice path lowers the cost by a factor of less than 2. • A optimal lattice path that has been ‘snaked’ is at most twice worse than the global optimal clustering (for a specific workload).

  20. Experiments • LineItem table from the TPC-D benchmark. • 3 dimensions: parts, supplier and time. • Record size 125 bytes. Page size 8K. • The results counted the number of pages and seeks for a set of queries with a given workload. • The results were normalized by the minimum amount of pages needed. • 27 workloads were tested. • Comparison with the 6 row major strategies

  21. Results

  22. Results

  23. Results

  24. Conclusions (from the article) • Physical clustering is crucial to data warehouse performance • Lattice paths can improve performance. • With respect to the Hilbert clustering, lattice path clustering performs sometimes better. • In a follow up paper: Hilbert is sandwiched between 2 snaked lattice paths, for a workload. • Snaking always reduces the cost. • There’s a snaked path which is globally optimal. • The snaked optimal lattice path is at most twice worst of the optimal clustering strategy.

  25. Remarks • How can we apply this to grids which contain dynamic lists in their records? • Lack of clarity and formalism in the proofs (some not published yet). • More work needs to be done to get meaningful improvements.

  26. Proofs Sketch • Define a clustering by a vector of edge types & number (<8,4;2,1>). • Show constraints on such a vector. - consistent vector. • Define an ordering on these vectors (or any), with a minimal definition. • Show that every minimal vector (with no diagonal edges & powers of two in the numbers is a snaked lattice path • Extend the definition of cost to arbitrary constrainded vectors. • Show that for every ‘diagonal strategy’ there exists a non-diagonal constrained vector with lower cost.

  27. Proofs Sketch • Show that for every workload there exists a snaked lattice path such that its cost is minimal. • Show that for any workload. The result of snaking the optimal lattice path lowers the cost by a factor of less than 2. • Show that for any workload the optimal snaked lattice path improves the snaked optimal lattice path by a factor of less than 2.

More Related