Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-ChaplinFaculty of Computer ScienceDalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.
Data Warehousing for Decision Support • Operational data collected into DW • DW used to support multi-dimensional views • Views form the basis of OLAP processing • Our focus: the OLAP server
Multi-dimensional views • Collection of feature attributes • Aggregate along one or more measure attributes • Reduce the granularity by “collapsing” dimensions • Points generated by: • distributive functions(e.g., sum) • algebraic functions (e.g., average) • holistic functions(e.g., median)
ABC AC BC AB B A C ALL Data Cube Generation • Proposed by Gray et al in 1995 • Can be generated “manually” from a relational DB but this is very inefficient • Exploit the relationship between cuboids to compute all 2d cuboids • In OLAP environments, we typically pre-compute these views to improve query response time
Existing Parallel Results • Goil & Choudhary • MOLAP solution • in-memory structures • global partition + d communication rounds • distributed views • Limitations • Memory for multi-dimensional arrays • expensive communication for larger d J. Of Data Mining & Knowledge Discovery 1(4), 1997
ABCD ABC ABD ACD BCD AC AB AD BC BD CD A A B B C C D D All Our Approach CCGrid’01 + J. Dist. & Parallel Databases 11(2), 2001 • ROLAP solution • Construct and cost the data cube lattice • Find a “least cost” spanning tree • Partition the spanning tree over the processors equally, construct views and distribute • Can handle partial cubes • Limitations • What about indexing?????
Parallel Multi-dimensional Indexing • Query specifies a range on multiple dimensions • Forms a hypercube in the point space
General Approach • No multidimensional index is universally successful • Exploit domain specific information and the features of a particular index • OLAP • Data is provided up front • Updates are batch oriented
Design Goals • A framework for distributed high-performance indexing of ROLAP cubes • Practical to implement • Low communication volume • Fully adapted to external memory (disks) • No shared disk required • Incrementally maintainable • Efficient for high D spatial searches • Scalable in terms of data size, dimensions, processors
ABC P1 P2 P3 P4 Challenge • How to order and partition data such that • Number of records retrieved per node is as balanced as possible • Minimize the number of disk seeks required in answering a query
Indexing the Data Cube • Combine the strengths of a space filling and an r-tree index • Use Hilbert curve to load buckets • Index buckets with r-tree • Update indexes with merge/sort
P1 P2 P3 P4 Query Retrieval ABC ABC ABC ABC
Original Space Processor 1 Processor 2 Example 8 points to be reported Reports:2 consecutive blocks & 4 points Reports:2 consecutive blocks & 4 points
The Parallel Framework • A single view is partitioned across p processors • Partial Hilbert/r-tree indexes are computed locally • Queries are answered concurrently • Queries answered individually or “piggy-backed”
The Virtual Data Cube • Problem: Full cube often to large to materialize • Solution: Use surrogate views
Other issues… • Dimension ordering • Query piggybacking • Batch updating • Managing Hierarchies of views
Experimental Results • Machine • 17 node cluster • Node = 1.8 GHz Xeon, 1 GB RAM, 2 * 40 GB IDE drives, running Linux • Interconnect = Intel Fast Ethernet switch • Test Data • 10 dimensions and 1,000,000 records
RCUBE index Construction Output: ~640 million rows, 16 Gigabytes
Distributed Query Resolution Test: Random queries returning ~15% of points (10 experiments per point)
Disk blocks retrieved vs. Disk Seeks Test: Random queries returning 5-15% of points (15 experiments per point)
Thank You Questions?