1 / 24

Parallel Multi-Dimensional ROLAP Indexing

Parallel Multi-Dimensional ROLAP Indexing. Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ. Data Warehousing for Decision Support. Operational data collected into DW

angus
Download Presentation

Parallel Multi-Dimensional ROLAP Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-ChaplinFaculty of Computer ScienceDalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.

  2. Data Warehousing for Decision Support • Operational data collected into DW • DW used to support multi-dimensional views • Views form the basis of OLAP processing • Our focus: the OLAP server

  3. Multi-dimensional views • Collection of feature attributes • Aggregate along one or more measure attributes • Reduce the granularity by “collapsing” dimensions • Points generated by: • distributive functions(e.g., sum) • algebraic functions (e.g., average) • holistic functions(e.g., median)

  4. ABC AC BC AB B A C ALL Data Cube Generation • Proposed by Gray et al in 1995 • Can be generated “manually” from a relational DB but this is very inefficient • Exploit the relationship between cuboids to compute all 2d cuboids • In OLAP environments, we typically pre-compute these views to improve query response time

  5. Existing Parallel Results • Goil & Choudhary • MOLAP solution • in-memory structures • global partition + d communication rounds • distributed views • Limitations • Memory for multi-dimensional arrays • expensive communication for larger d J. Of Data Mining & Knowledge Discovery 1(4), 1997

  6. ABCD ABC ABD ACD BCD AC AB AD BC BD CD A A B B C C D D All Our Approach CCGrid’01 + J. Dist. & Parallel Databases 11(2), 2001 • ROLAP solution • Construct and cost the data cube lattice • Find a “least cost” spanning tree • Partition the spanning tree over the processors equally, construct views and distribute • Can handle partial cubes • Limitations • What about indexing?????

  7. Parallel Multi-dimensional Indexing • Query specifies a range on multiple dimensions • Forms a hypercube in the point space

  8. General Approach • No multidimensional index is universally successful • Exploit domain specific information and the features of a particular index • OLAP • Data is provided up front • Updates are batch oriented

  9. Design Goals • A framework for distributed high-performance indexing of ROLAP cubes • Practical to implement • Low communication volume • Fully adapted to external memory (disks) • No shared disk required • Incrementally maintainable • Efficient for high D spatial searches • Scalable in terms of data size, dimensions, processors

  10. ABC P1 P2 P3 P4 Challenge • How to order and partition data such that • Number of records retrieved per node is as balanced as possible • Minimize the number of disk seeks required in answering a query

  11. Indexing the Data Cube • Combine the strengths of a space filling and an r-tree index • Use Hilbert curve to load buckets • Index buckets with r-tree • Update indexes with merge/sort

  12. Space Filling Curves & Striping

  13. P1 P2 P3 P4 Query Retrieval ABC ABC ABC ABC

  14. Original Space Processor 1 Processor 2 Example 8 points to be reported Reports:2 consecutive blocks & 4 points Reports:2 consecutive blocks & 4 points

  15. The Parallel Framework • A single view is partitioned across p processors • Partial Hilbert/r-tree indexes are computed locally • Queries are answered concurrently • Queries answered individually or “piggy-backed”

  16. The Virtual Data Cube • Problem: Full cube often to large to materialize • Solution: Use surrogate views

  17. Surrogate Processing

  18. Other issues… • Dimension ordering • Query piggybacking • Batch updating • Managing Hierarchies of views

  19. Experimental Results • Machine • 17 node cluster • Node = 1.8 GHz Xeon, 1 GB RAM, 2 * 40 GB IDE drives, running Linux • Interconnect = Intel Fast Ethernet switch • Test Data • 10 dimensions and 1,000,000 records

  20. RCUBE index Construction Output: ~640 million rows, 16 Gigabytes

  21. Distributed Query Resolution Test: Random queries returning ~15% of points (10 experiments per point)

  22. Disk blocks retrieved vs. Disk Seeks Test: Random queries returning 5-15% of points (15 experiments per point)

  23. Distributed Query Resolution in Surrogate Group-bys

  24. Thank You Questions?

More Related