Efficient Parallel Multi-Dimensional ROLAP Indexing for Enhanced Data Cube Management

Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-ChaplinFaculty of Computer ScienceDalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.

Data Warehousing for Decision Support • Operational data collected into DW • DW used to support multi-dimensional views • Views form the basis of OLAP processing • Our focus: the OLAP server

Multi-dimensional views • Collection of feature attributes • Aggregate along one or more measure attributes • Reduce the granularity by “collapsing” dimensions • Points generated by: • distributive functions(e.g., sum) • algebraic functions (e.g., average) • holistic functions(e.g., median)

ABC AC BC AB B A C ALL Data Cube Generation • Proposed by Gray et al in 1995 • Can be generated “manually” from a relational DB but this is very inefficient • Exploit the relationship between cuboids to compute all 2d cuboids • In OLAP environments, we typically pre-compute these views to improve query response time

Existing Parallel Results • Goil & Choudhary • MOLAP solution • in-memory structures • global partition + d communication rounds • distributed views • Limitations • Memory for multi-dimensional arrays • expensive communication for larger d J. Of Data Mining & Knowledge Discovery 1(4), 1997

ABCD ABC ABD ACD BCD AC AB AD BC BD CD A A B B C C D D All Our Approach CCGrid’01 + J. Dist. & Parallel Databases 11(2), 2001 • ROLAP solution • Construct and cost the data cube lattice • Find a “least cost” spanning tree • Partition the spanning tree over the processors equally, construct views and distribute • Can handle partial cubes • Limitations • What about indexing?????

Parallel Multi-dimensional Indexing • Query specifies a range on multiple dimensions • Forms a hypercube in the point space

General Approach • No multidimensional index is universally successful • Exploit domain specific information and the features of a particular index • OLAP • Data is provided up front • Updates are batch oriented

Design Goals • A framework for distributed high-performance indexing of ROLAP cubes • Practical to implement • Low communication volume • Fully adapted to external memory (disks) • No shared disk required • Incrementally maintainable • Efficient for high D spatial searches • Scalable in terms of data size, dimensions, processors

ABC P1 P2 P3 P4 Challenge • How to order and partition data such that • Number of records retrieved per node is as balanced as possible • Minimize the number of disk seeks required in answering a query

Indexing the Data Cube • Combine the strengths of a space filling and an r-tree index • Use Hilbert curve to load buckets • Index buckets with r-tree • Update indexes with merge/sort

Space Filling Curves & Striping

P1 P2 P3 P4 Query Retrieval ABC ABC ABC ABC

Original Space Processor 1 Processor 2 Example 8 points to be reported Reports:2 consecutive blocks & 4 points Reports:2 consecutive blocks & 4 points

The Parallel Framework • A single view is partitioned across p processors • Partial Hilbert/r-tree indexes are computed locally • Queries are answered concurrently • Queries answered individually or “piggy-backed”

The Virtual Data Cube • Problem: Full cube often to large to materialize • Solution: Use surrogate views

Surrogate Processing

Other issues… • Dimension ordering • Query piggybacking • Batch updating • Managing Hierarchies of views

Experimental Results • Machine • 17 node cluster • Node = 1.8 GHz Xeon, 1 GB RAM, 2 * 40 GB IDE drives, running Linux • Interconnect = Intel Fast Ethernet switch • Test Data • 10 dimensions and 1,000,000 records

RCUBE index Construction Output: ~640 million rows, 16 Gigabytes

Distributed Query Resolution Test: Random queries returning ~15% of points (10 experiments per point)

Disk blocks retrieved vs. Disk Seeks Test: Random queries returning 5-15% of points (15 experiments per point)

Distributed Query Resolution in Surrogate Group-bys

Thank You Questions?

Efficient Parallel Multi-Dimensional ROLAP Indexing for Enhanced Data Cube Management

Efficient Parallel Multi-Dimensional ROLAP Indexing for Enhanced Data Cube Management

Presentation Transcript

Multi-dimensional arrays

High Dimensional Indexing

Multi-Dimensional Arrays

Multi-Dimensional Arrays

Multi-Dimensional Arrays

Multi-dimensional Indexes

Multi-Dimensional Arrays

Multi-Dimensional Arrays

Rethinking Choices for Multi-dimensional Point Indexing

Chapter 7 Parallel Indexing

Multi-Dimensional Arrays

R-tree: Indexing Structure for Data in Multi-dimensional Space

Multi-Dimensional Arrays

Implementation of Multi-dimensional Indexing using Pyramid Technique Presented by

Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions

Spatial Indexing and Visualizing Large Multi-dimensional Databases

Chapter 7 Parallel Indexing

MULTI-DIMENSIONAL SECURITY

Parallel Multi-Dimensional ROLAP Indexing

Rethinking Choices for Multi-dimensional Point Indexing

MULTI-DIMENSIONAL SECURITY

Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions