Efficient Spatial OLAP Operations for Aggregate Queries in Data Warehouses

Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong

Motivating Scenario The spatial dimension at the finest granularity consists of a set of regions (e.g., road segments in traffic supervision systems, areas covered by cells in mobile communication systems) The raw data provide the set of objects that fall in each region every timestamp (e.g., cars in a road segment, users serviced by a cell). Queries ask for aggregate data over regions that satisfy some spatio-temporal condition (find the current traffic in all areas in a 1km range around each hospital). Unlike traditional OLAP, there do not exist pre-defined hierarchies.

The aggregate R-tree An R-tree with aggregate data for every entry. The same idea can be applied for other access methods (e.g, quadtrees). Other functions may be used (e.g., avg, max).

Why keep spatiotemporal aggregate information For efficient query processing (e.g., the number of objects inside an area can be found by a window query instead of a spatial join). Aggregate information is all that we need/know for some applications (e.g., traffic systems record the number of cars in an area not their ids) Storing historical information about individual objects may raise privacy issues (having all locations of mobile phone users through history may be illegal) Although the actual data may be highly volatile and involve extreme space requirements, the summarized data are less voluminous and may remain rather constant for long intervals.

aR-trees and OLAP operations The aR-tree corresponds to a lattice. There may be multiple dimensions.

Query Processing- Single Window "find the total number of cars on all road segments inside a query window" • Start from the root of the aR-tree: for all entries one of the following three conditions may hold: • · The entry is disjoint with the query window; thus, the corresponding node cannot contain any cars contributing to the answer and is not retrieved. • · The entry is inside the query window in which case all aggregate information is stored with the entry and the corresponding node does not need to be accessed. • · The entry partially overlaps the query window in which case the corresponding node must be recursively followed.

Query Processing - Multiple Windows "Find the total number of cars on road segments inside each city suburb" Without aR-trees, the query can be processed as a multiway spatial join (suburbs, cars, road segments). With aR-trees, it is processed as a pairwise join (suburbs, aR-tree). If the query windows (i.e., suburbs) fit in memory, we propose an extension of the single-window technique that considers all windows in parallel.

Experimental Settings Tiger Dataset (130,000 road segments) We randomly selected 5,000 seed points which were located on roads. For each seed point, we generated a cluster with 250 points (i.e. car positions) with Gaussian distribution; therefore the total number of cars was 1.25M. The distribution of the queries follows the distribution of the roads

Evaluation for Single-Window Queries Raw data approach: join the cars and streets datasets. Fact table approach: an R-tree indexes the fact table (i.e., similar to aR-trees, but no aggregate information in the intermediate nodes).

Evaluation for Multiple-Window Queries aR-tree (single queries):a set of single-window queries processed using the single_aggregation algorithm of aR-trees. Fact table (join): join between the R-tree index of the fact table and the query windows which fit in memory. Fact table (single): indexed nested loops using the R-tree index of the fact table.

Applications to spatio-temporal data Query: "find the total number of objects in the regions intersecting some window qs during a time interval qt"

The aggregate 3DR-tree (a3DR-tree) Each entry has the form <r.MBR, r.pointer, r.lifespan, r.aggr[]>, that is, for each region it keeps the aggregate value and the time interval during which this value is valid. Whenever the aggregate information about a region changes a new entry is created. Advantage: the a3DR-tree integrates spatial and temporal dimensions in the same structure (and is, therefore, expected to be more efficient than column scanning for queries that involve both conditions) Disadvantage: it wastes space by storing the MBR each time there is an aggregate change

The aggregate RB tree

Query Example Find all objects in some region overlapping the query window qs during the time interval [1-3]

The aggregate 3DRB-tree

Conclusions and directions for future work Spatio-temporal OLAP very promising direction of work Incorporation of multi-version structures for dynamic dimensions Formalization - analysis of when aggregation multi-trees are preferable

Efficient Spatial OLAP Operations for Aggregate Queries in Data Warehouses

Efficient Spatial OLAP Operations for Aggregate Queries in Data Warehouses

Presentation Transcript

Data Warehouses

Data Warehousing: Data Models and OLAP operations

Data Warehouses

BUSINESS DRIVEN TECHNOLOGY Data Warehouses, OLAP and Data Mining

Data Warehouse Models and OLAP Operations

Data Warehouses and OLAP

Data Warehouses, OLAP and Data Mining

Data Warehouses and OLAP

Data Warehouses

Efficient OLAP Query Processing for Distributed Data Warehouses

Data Warehouses and OLAP

Data Warehousing: Data Models and OLAP operations

Data Warehouses

Data Warehouses and OLAP

Data Warehouses

Parallel Operations in Data Warehouses

Rectangle-Efficient Aggregation in Spatial Data Streams

Data Warehouses and OLAP Data Management

Data Warehouses

Chapter 6 DATABASES, DATA WAREHOUSES AND OLAP

Data Warehouse Models and OLAP Operations

Data warehouses