parallel multi dimensional rolap indexing n.
Skip this Video
Loading SlideShow in 5 Seconds..
Parallel Multi-Dimensional ROLAP Indexing PowerPoint Presentation
Download Presentation
Parallel Multi-Dimensional ROLAP Indexing

play fullscreen
1 / 24
Download Presentation

Parallel Multi-Dimensional ROLAP Indexing - PowerPoint PPT Presentation

angus
83 Views
Download Presentation

Parallel Multi-Dimensional ROLAP Indexing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-ChaplinFaculty of Computer ScienceDalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.

  2. Data Warehousing for Decision Support • Operational data collected into DW • DW used to support multi-dimensional views • Views form the basis of OLAP processing • Our focus: the OLAP server

  3. Multi-dimensional views • Collection of feature attributes • Aggregate along one or more measure attributes • Reduce the granularity by “collapsing” dimensions • Points generated by: • distributive functions(e.g., sum) • algebraic functions (e.g., average) • holistic functions(e.g., median)

  4. ABC AC BC AB B A C ALL Data Cube Generation • Proposed by Gray et al in 1995 • Can be generated “manually” from a relational DB but this is very inefficient • Exploit the relationship between cuboids to compute all 2d cuboids • In OLAP environments, we typically pre-compute these views to improve query response time

  5. Existing Parallel Results • Goil & Choudhary • MOLAP solution • in-memory structures • global partition + d communication rounds • distributed views • Limitations • Memory for multi-dimensional arrays • expensive communication for larger d J. Of Data Mining & Knowledge Discovery 1(4), 1997

  6. ABCD ABC ABD ACD BCD AC AB AD BC BD CD A A B B C C D D All Our Approach CCGrid’01 + J. Dist. & Parallel Databases 11(2), 2001 • ROLAP solution • Construct and cost the data cube lattice • Find a “least cost” spanning tree • Partition the spanning tree over the processors equally, construct views and distribute • Can handle partial cubes • Limitations • What about indexing?????

  7. Parallel Multi-dimensional Indexing • Query specifies a range on multiple dimensions • Forms a hypercube in the point space

  8. General Approach • No multidimensional index is universally successful • Exploit domain specific information and the features of a particular index • OLAP • Data is provided up front • Updates are batch oriented

  9. Design Goals • A framework for distributed high-performance indexing of ROLAP cubes • Practical to implement • Low communication volume • Fully adapted to external memory (disks) • No shared disk required • Incrementally maintainable • Efficient for high D spatial searches • Scalable in terms of data size, dimensions, processors

  10. ABC P1 P2 P3 P4 Challenge • How to order and partition data such that • Number of records retrieved per node is as balanced as possible • Minimize the number of disk seeks required in answering a query

  11. Indexing the Data Cube • Combine the strengths of a space filling and an r-tree index • Use Hilbert curve to load buckets • Index buckets with r-tree • Update indexes with merge/sort

  12. Space Filling Curves & Striping

  13. P1 P2 P3 P4 Query Retrieval ABC ABC ABC ABC

  14. Original Space Processor 1 Processor 2 Example 8 points to be reported Reports:2 consecutive blocks & 4 points Reports:2 consecutive blocks & 4 points

  15. The Parallel Framework • A single view is partitioned across p processors • Partial Hilbert/r-tree indexes are computed locally • Queries are answered concurrently • Queries answered individually or “piggy-backed”

  16. The Virtual Data Cube • Problem: Full cube often to large to materialize • Solution: Use surrogate views

  17. Surrogate Processing

  18. Other issues… • Dimension ordering • Query piggybacking • Batch updating • Managing Hierarchies of views

  19. Experimental Results • Machine • 17 node cluster • Node = 1.8 GHz Xeon, 1 GB RAM, 2 * 40 GB IDE drives, running Linux • Interconnect = Intel Fast Ethernet switch • Test Data • 10 dimensions and 1,000,000 records

  20. RCUBE index Construction Output: ~640 million rows, 16 Gigabytes

  21. Distributed Query Resolution Test: Random queries returning ~15% of points (10 experiments per point)

  22. Disk blocks retrieved vs. Disk Seeks Test: Random queries returning 5-15% of points (15 experiments per point)

  23. Distributed Query Resolution in Surrogate Group-bys

  24. Thank You Questions?