Data Management and Data Processing Support on Array-Based Scientific Data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

Data Management and Data Processing Support on Array-Based Scientific Data PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on
  • Presentation posted in: General

Data Management and Data Processing Support on Array-Based Scientific Data. Yi Wang Advisor: Gagan Agrawal. Candidacy Examination. Big Data Is Often Big Arrays. Array data is everywhere. Molecular Simulation: Molecular Data. Life Science: DNA Sequencing Data (Microarray). Earth Science:

Download Presentation

Data Management and Data Processing Support on Array-Based Scientific Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data management and data processing support on array based scientific data

Data Management and Data Processing Support on Array-Based Scientific Data

Yi Wang

Advisor:Gagan Agrawal

Candidacy Examination


Big data is often big arrays

Big Data Is Often Big Arrays

  • Array data is everywhere

Molecular Simulation:

Molecular Data

Life Science:

DNA Sequencing Data (Microarray)

Earth Science:

Ocean and Climate Data

Space Science:

Astronomy Data


Inherent limitations of current tools and paradigms

Inherent Limitations of Current Tools and Paradigms

  • Most scientific data management and data processing tools are too heavy-weight

    • Hard to cope with different data formats and physical structures (variety)

    • Data transformation and data transfer are often prohibitively expensive (volume)

  • Prominent Examples

    • RDBMSs: not suited for array data

    • Array DBMSs: data ingestion

    • MapReduce: specialized file system


Mismatch between scientific data and dbms

Mismatch Between Scientific Data and DBMS

  • Scientific (Array) Datasets:

    • Very large but processed infrequently

    • Read/append only

    • No resources for reloading data

    • Popular formats: NetCDF and HDF5

  • Database Technologies

    • For (read-write) data – ACID guaranteed

    • Assume data reloading/reformattingfeasible


Example array data format hdf5

Example Array Data Format - HDF5

  • HDF5 (Hierarchical Data Format)


The upfront cost of using scidb

The Upfront Cost of Using SciDB

  • High-Level Data Flow

    • Requires data ingestion

  • Data Ingestion Steps

    • Raw files (e.g., HDF5) -> CSV

    • Load CSV files into SciDB

“EarthDB: scalable analysis of MODIS data using SciDB”

- G. Planthaber et al.


Thesis statement

Thesis Statement

  • Native Data Can Be Queried and/or Processed Efficiently Using Popular Abstractions

    • Process data stored in the native format, e.g., NetCDF and HDF5

    • Support SQL-like operators, e.g., selection and aggregation

    • Support array operations, e.g., structural aggregations

    • Support MapReduce-like processing API


Outline

Outline

  • Data Management Support

    • Supporting a Light-Weight Data Management Layer Over HDF5

    • SAGA: Array Storage as a DB with Support for Structural Aggregations

    • Approximate Aggregations Using Novel Bitmap Indices

  • Data Processing Support

    • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

  • Future Work


Overall idea

Overall Idea

  • An SQL Implementation Over HDF5

    • Ease-of-use: declarative language instead of low-level programming language + HDF5 API

    • Abstraction: provides a virtual relational view

  • High Efficiency

    • Load data on demand (lazy loading)

    • Parallel query processing

    • Server-side aggregation


Functionality

Functionality

  • Query Based on Dimension Index Values (Type 1)

    • Also supported by HDF5 API

  • Query Based on Dimension Scales (Type 2)

    • coordinate system instead of the physical layout (array subscript)

  • Query Based on Data Values (Type 3)

    • Simple datatype + compound datatype

  • Aggregate Query

    • SUM, COUNT, AVG, MIN, and MAX

    • Server-side aggregation to minimize the data transfer

index-based condition

coordinate-based condition

content-based condition


Execution overview

Execution Overview

1D: AND-logic condition list

2D: OR-logic condition list

1D: OR-logic condition list

Same content-based condition

11


Experimental setup

Experimental Setup

  • Experimental Datasets

    • 4 GB (sequential experiments) and 16 GB (parallel experiments)

    • 4D: time, cols, rows, and layers

  • Compared with Baseline Performance and OPeNDAP

    • Baseline performance: no query parsing

    • OPeNDAP: translates HDF5 into a specialized data format


Sequential comparison with opendap type2 and type3 queries

Sequential Comparison with OPeNDAP (Type2 and Type3 Queries)


Parallel query processing for type2 and type3 queries

Parallel Query Processing for Type2 and Type3 Queries


Outline1

Outline

  • Data Management Support

    • Supporting a Light-Weight Data Management Layer Over HDF5

    • SAGA: Array Storage as a DB with Support for Structural Aggregations

    • Approximate Aggregations Using Novel Bitmap Indices

  • Data Processing Support

    • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

  • Future Work


Array storage as a db

Array Storage as a DB

  • A Paradigm Similar to NoDB

    • Still maintains DB functionality

    • But no data ingestion

  • DB and Array Storage as a DB: Friends or Foes?

    • When to use DB?

      • Load once, and query frequently

    • When to directly use array storage?

      • Query infrequently, so avoid loading

  • Our System

    • Focuses on a set of special array operations - Structural Aggregations


Structural aggregation types

Structural Aggregation Types

Non-Overlapping Aggregation

Overlapping Aggregation


Grid aggregation

Grid Aggregation

  • Parallelization: Easy after Partitioning

  • Considerations

    • Data contiguity which affects the I/O performance

    • Communication cost

    • Load balancing for skewed data

  • Partitioning Strategies

    • Coarse-grained

    • Fine-grained

    • Hybrid

    • Auto-grained


Partitioning strategy decider

Partitioning Strategy Decider

  • Cost Model: analyze loading cost and computation cost separately

    • Load cost

      • Loading factor × data amount

    • Computation cost

  • Exception - Auto-Grained: take loading cost and computation cost as a whole


Overlapping aggregation

Overlapping Aggregation

  • I/O Cost

    • Reuse the data already in the memory

    • Reduce the disk I/O to enhance the I/O performance

  • Memory Accesses

    • Reuse the data already in the cache

    • Reduce cache misses to accelerate the computation

  • Aggregation Approaches

    • Naïve approach

    • Data-reuse approach

    • All-reuse approach


Example hierarchical aggregation

Example: Hierarchical Aggregation

  • Aggregate 3 grids in a 6 × 6 array

    • The innermost 2 × 2 grid

    • The middle 4 × 4 grid

    • The outmost 6 × 6 grid

  • (Parallel) sliding aggregation is much more complicated


Na ve approach

Naïve Approach

Load the innermost grid

Aggregate the innermost grid

Load the middle grid

Aggregate the middle grid

Load the outermost grid

Aggregate the outermost grid

For N grids:

N loads + N aggregations


Data reuse approach

Data-Reuse Approach

Load the outermost grid

Aggregate the outermost grid

Aggregate the middle grid

Aggregate the innermost grid

For N grids:

1 load + N aggregations


All reuse approach

All-Reuse Approach

Load the outermost grid

Once an element is accessed, accumulatively update the aggregation results it contributes to

For N grids:

1 load + 1 aggregation

Only update the outermost aggregation

result

Update both the outermost and the middle aggregation results

Update all the 3 aggregation results


Sequential performance comparison

Sequential Performance Comparison

  • Array slab/data size (8 GB) ratio: from 12.5% to 100%

  • Coarse-grained partitioning for the grid aggregation

  • All-reuse approach for the sliding aggregation

  • SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation


Parallel sliding aggregation performance

Parallel Sliding Aggregation Performance

  • # of nodes: from 1 to 16

  • 8 GB data

  • Sliding grid size: from 3 × 3 to 6 × 6


Outline2

Outline

  • Data Management Support

    • Supporting a Light-Weight Data Management Layer Over HDF5

    • SAGA: Array Storage as a DB with Support for Structural Aggregations

    • Approximate Aggregations Using Novel Bitmap Indices

  • Data Processing Support

    • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

  • Future Work


Approximate aggregations over array data

Approximate Aggregations Over Array Data

  • Challenges

    • Flexible Aggregation Over Any Subset

      • Dimensional-based/value-based/combined predicate

    • Aggregation Accuracy

      • Spatial distribution/value distribution

    • Aggregation Without Data Reorganization

      • Reorganization is prohibitively expensive

  • Existing Techniques - All Problematic for Array Data

    • Sampling: unable to capture both distributions

    • Histograms: no spatial distribution

    • Wavelets:no value distribution

  • New Data Synopses – Bitmap Indices


Bitmap indexing and pre aggregation

Bitmap Indexing and Pre-Aggregation

  • Bitmap Indices

  • Pre-Aggregation Statistics


Approximate aggregation workflow

Approximate Aggregation Workflow


Running example

Running Example

SELECT SUM(Array) WHERE Value > 3 AND ID < 4;

  • Bitmap Indices

  • Pre-Aggregation Statistics

Predicate Bitvector: 11110000

i1’: 01000000

i2’: 10010000

Count1: 1

Count2: 2

Estimated Sum: 7 × 1/2 + 16 × 2/3 = 14.167

Precise Sum: 14


A novel binning strategy

A Novel Binning Strategy

  • Conventional Binning Strategies

    • Equi-width/Equi-depth

    • Not designed for aggregation

  • V-Optimized Binning Strategy

    • Inspired by V-Optimal Histogram

    • Goal: approximately minimize Sum Squared Error (SSE)

    • Unbiased V-Optimized Binning: data is queried randomly

    • Weighted V-Optimized Binning: frequently queried subarea is prior knowledge


Unbiased v optimized binning

Unbiased V-Optimized Binning

  • 3 Steps:

    • Initial Binning: use equi-depth binning

    • Iterative Refinement: adjusting bin boundaries

    • Bitvector Generation: mark spatial positions


Weighted v optimized binning

Weighted V-Optimized Binning

  • Difference: minimize WSSE instead of SSE

  • Similar binning algorithm

  • Major Modification

    • representative value for each bin is not the mean value


Experimental setup1

Experimental Setup

  • Data Skew

    • Dense Range: less than 5% space but over 90% data

    • Sparse Range: less than 95% space but over 10% data

  • 5 Types of Queries

    • DB: with dimension-based predicates

    • VBD: with value-based predicates over dense range

    • VBS : with value-based predicates over sparse range

    • CD: with combined predicates over dense range

    • CS : with combined predicates over sparse range

  • Ratio of Querying Possibilities – 10 : 1

    • 50% synthetic data is frequently queried

    • 25% real-world data is frequently queried


Sum aggregation accuracy of different binning strategies on the synthetic dataset

SUM Aggregation Accuracy of Different Binning Strategies on the Synthetic Dataset

Equi-Width

Equi-Depth

Unbiased V-Optimized

Weighted V-Optimized


Sum aggregation accuracy of different methods on the real world dataset

SUM Aggregation Accuracy of Different Methods on the Real-World Dataset

Sampling_2%

Sampling_20%

(Equi-Depth) MD-Histogram

Equi-Depth

Unbiased V-Optimized

Weighted V-Optimized


Outline3

Outline

  • Data Management Support

    • Supporting a Light-Weight Data Management Layer Over HDF5

    • SAGA: Array Storage as a DB with Support for Structural Aggregations

    • Approximate Aggregations Using Novel Bitmap Indices

  • Data Processing Support

    • SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

  • Future Work


Scientific data analysis today

Scientific Data Analysis Today

  • “Store-First-Analyze-After”

    • Reload data into another file system

      • E.g., load data from PVFS to HDFS

    • Reload data into another data format

      • E.g., load NetCDF/HDF5 data to a specialized format

  • Problems

    • Long data migration/transformation time

    • Stresses network and disks


System overview

System Overview

  • Key Feature

    • scientific data processing module


Scientific data processing module

Scientific Data Processing Module


Parallel data processing times on 16 gb datasets

Parallel Data Processing Times on 16 GB Datasets

  • K-Means

  • KNN


Future work outline

Future Work Outline

  • Data Management Support

    • SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices

    • SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices

  • Data Processing Support

    • StreamingMATE: A Novel MapReduce-Like Framework Over Scientific Data Stream


Scisd

SciSD

  • Subgroup Discovery

    • Goal: identify all the subsets that are significantly different from the entire dataset/general population, w.r.t. a target variable

    • Can be widely used in scientific knowledge discovery

  • Novelty

    • Subsets can involve dimensional and/or value ranges

    • All numeric attributes

    • High efficiency by frequent bitmap-based approximate aggregations


Running example1

Running Example


Scicsm

SciCSM

  • “Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more.” - Darby Conley, Get Fuzzy, 2001

  • Contrast Set Mining

    • Goal: identify all the filters that can generate significantly different subsets

    • Common filters: time periods, spatial areas, etc.

    • Usage: classifier design, change detection, disaster prediction, etc.


Running example2

Running Example


Streamingmate

StreamingMATE

  • Extend the precursor system SciMATE to process scientific data stream

  • Generalized Reduction

    • Reduce data stream to a reduction object

    • No shuffling or sorting

  • Focus on the load balancing issues

    • Input data volume can be highly variable

    • Topology update: add/remove/update streaming operators


Streamingmate overview

StreamingMATE Overview


Hyperslab selector

Hyperslab Selector

True:

nullify the elementary condition

False:

nullify the condition list

4-dim Salinity Dataset

dim1: time [0, 1023]

dim2: cols [0, 166]

dim3: rows [0, 62]

dim4: layers [0, 33]

Fill up all the index boundary values


Type2 and type3 query examples

Type2 and Type3 Query Examples


Aggregation query examples

Aggregation Query Examples

  • AG1: Simple global aggregation

  • AG2: GROUP BY clause + HAVING clause

  • AG3: GROUP BY clause


Sequential and parallel performance of aggregation queries

Sequential and Parallel Performance of Aggregation Queries


Array databases

Array Databases

  • Examples: SciDB, RasDaMan and MonetDB

  • Take Array as the First-Class Citizens

    • Everything is defined in the array dialect

    • Lightweight or No ACID Maintenance

    • No write conflict: ACID is inherently guaranteed

    • Other Desired Functionality

    • Structural aggregations, array join, provenance…


  • Structural aggregations

    Structural Aggregations

    • Aggregate the elements based on positional relationships

      • E.g., moving average: calculates the average of each 2 × 2 square from left to right

    Input Array

    Aggregation Result

    aggregate the elements in the same square at a time


    Coarse grained partitioning

    Coarse-Grained Partitioning

    • Pros

      • Low I/O cost

      • Low communication cost

    • Cons

      • Workload imbalance for skewed data


    Fine grained partitioning

    Fine-Grained Partitioning

    • Pros

      • Excellent workload balance for skewed data

    • Cons

      • Relatively high I/O cost

      • High communication cost


    Hybrid partitioning

    Hybrid Partitioning

    • Pros

      • Low communication cost

      • Good workload balance for skewed data

    • Cons

      • High I/O cost


    Auto grained partitioning

    Auto-Grained Partitioning

    • 2 Steps

      • Estimate the grid density (after filtering) by sampling, and thus, estimate the computation cost (based on the time complexity)

        • For each grid, total processing cost = constant loading cost + varying computation cost

      • Partitions the cost array - Balanced Contiguous Multi-Way Partitioning

        • Dynamic programming (small # of grids)

        • Greedy (large # of grids)


    Auto grained partitioning cont d

    Auto-Grained Partitioning (Cont’d)

    • Pros

      • Low I/O cost

      • Low communication cost

      • Great workload balance for skewed data

    • Cons

      • Overhead of sampling an runtime partitioning


    Partitioning strategy summary

    Partitioning Strategy Summary

    Our partitioning strategy decider can help choose the best strategy


    All reuse approach cont d

    All-Reuse Approach (Cont’d)

    • Key Insight

      • # of aggregates ≤ # of queried elements

      • More computationally efficient to iterate over elements and update the associated aggregates

    • More Benefits

      • Load balance (for hierarchical/circular aggregations)

      • More speedup for compound array elements

        • The data type of an aggregate is usually primitive, but this is not always true for an array element


    Parallel grid aggregation performance

    Parallel Grid Aggregation Performance

    • Used 4 processors on a Real-Life Dataset of 8 GB

    • User-Defined Aggregation: K-Means

      • Vary the number of iterations to vary to the computation amount


    Data access strategies and patterns

    Data Access Strategies and Patterns

    • Full Read: probably too expensive for reading a small data subset

    • Partial Read

      • Strided pattern

      • Column pattern

      • Discrete point pattern


    Indexing cost of different binning strategies with varying of bins on the synthetic dataset

    Indexing Cost of Different Binning Strategies with Varying # of Bins on the Synthetic Dataset


    Sum aggregation of equi width binning with varying of bins on the synthetic dataset

    SUM Aggregation of Equi-Width Binning with Varying # of Bins on the Synthetic Dataset


    Sum aggregation of equi depth binning with varying of bins on the synthetic dataset

    SUM Aggregation of Equi-Depth Binning with Varying # of Bins on the Synthetic Dataset


    Sum aggregation of v optimized binning with varying of bins on the synthetic dataset

    SUM Aggregation of V-Optimized Binning with Varying # of Bins on the Synthetic Dataset


    Average relative error of max aggregation of different methods on the real world dataset

    Average Relative Error(%) of MAX Aggregation of Different Methods on the Real-World Dataset


    Sum aggregation times of different methods on the real world dataset db

    SUM Aggregation Times of Different Methods on the Real-World Dataset (DB)


    Sum aggregation times of different methods on the real world dataset vbd

    SUM Aggregation Times of Different Methods on the Real-World Dataset (VBD)


    Sum aggregation times of different methods on the real world dataset vbs

    SUM Aggregation Times of Different Methods on the Real-World Dataset (VBS)


    Sum aggregation times of different methods on the real world dataset cd

    SUM Aggregation Times of Different Methods on the Real-World Dataset (CD)


    Sum aggregation times of different methods on the real world dataset cd1

    SUM Aggregation Times of Different Methods on the Real-World Dataset (CD)


    Sd vs classification

    SD vs. Classification


  • Login