83 Views

Download Presentation
## Parallel Multi-Dimensional ROLAP Indexing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Parallel Multi-Dimensional ROLAP Indexing**Andrew Rau-ChaplinFaculty of Computer ScienceDalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.**Data Warehousing for Decision Support**• Operational data collected into DW • DW used to support multi-dimensional views • Views form the basis of OLAP processing • Our focus: the OLAP server**Multi-dimensional views**• Collection of feature attributes • Aggregate along one or more measure attributes • Reduce the granularity by “collapsing” dimensions • Points generated by: • distributive functions(e.g., sum) • algebraic functions (e.g., average) • holistic functions(e.g., median)**ABC**AC BC AB B A C ALL Data Cube Generation • Proposed by Gray et al in 1995 • Can be generated “manually” from a relational DB but this is very inefficient • Exploit the relationship between cuboids to compute all 2d cuboids • In OLAP environments, we typically pre-compute these views to improve query response time**Existing Parallel Results**• Goil & Choudhary • MOLAP solution • in-memory structures • global partition + d communication rounds • distributed views • Limitations • Memory for multi-dimensional arrays • expensive communication for larger d J. Of Data Mining & Knowledge Discovery 1(4), 1997**ABCD**ABC ABD ACD BCD AC AB AD BC BD CD A A B B C C D D All Our Approach CCGrid’01 + J. Dist. & Parallel Databases 11(2), 2001 • ROLAP solution • Construct and cost the data cube lattice • Find a “least cost” spanning tree • Partition the spanning tree over the processors equally, construct views and distribute • Can handle partial cubes • Limitations • What about indexing?????**Parallel Multi-dimensional Indexing**• Query specifies a range on multiple dimensions • Forms a hypercube in the point space**General Approach**• No multidimensional index is universally successful • Exploit domain specific information and the features of a particular index • OLAP • Data is provided up front • Updates are batch oriented**Design Goals**• A framework for distributed high-performance indexing of ROLAP cubes • Practical to implement • Low communication volume • Fully adapted to external memory (disks) • No shared disk required • Incrementally maintainable • Efficient for high D spatial searches • Scalable in terms of data size, dimensions, processors**ABC**P1 P2 P3 P4 Challenge • How to order and partition data such that • Number of records retrieved per node is as balanced as possible • Minimize the number of disk seeks required in answering a query**Indexing the Data Cube**• Combine the strengths of a space filling and an r-tree index • Use Hilbert curve to load buckets • Index buckets with r-tree • Update indexes with merge/sort**P1**P2 P3 P4 Query Retrieval ABC ABC ABC ABC**Original Space**Processor 1 Processor 2 Example 8 points to be reported Reports:2 consecutive blocks & 4 points Reports:2 consecutive blocks & 4 points**The Parallel Framework**• A single view is partitioned across p processors • Partial Hilbert/r-tree indexes are computed locally • Queries are answered concurrently • Queries answered individually or “piggy-backed”**The Virtual Data Cube**• Problem: Full cube often to large to materialize • Solution: Use surrogate views**Other issues…**• Dimension ordering • Query piggybacking • Batch updating • Managing Hierarchies of views**Experimental Results**• Machine • 17 node cluster • Node = 1.8 GHz Xeon, 1 GB RAM, 2 * 40 GB IDE drives, running Linux • Interconnect = Intel Fast Ethernet switch • Test Data • 10 dimensions and 1,000,000 records**RCUBE index Construction**Output: ~640 million rows, 16 Gigabytes**Distributed Query Resolution**Test: Random queries returning ~15% of points (10 experiments per point)**Disk blocks retrieved vs. Disk Seeks**Test: Random queries returning 5-15% of points (15 experiments per point)**Thank You**Questions?