tools and techniques for the data grid n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Tools and Techniques for the Data Grid PowerPoint Presentation
Download Presentation
Tools and Techniques for the Data Grid

Loading in 2 Seconds...

play fullscreen
1 / 48

Tools and Techniques for the Data Grid - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Tools and Techniques for the Data Grid. Gagan Agrawal. Grids and Data Grids. Grid Computing Large scale problem solving using resources over the internet Distributed computing, but across multiple administrative domains Data Grid

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Tools and Techniques for the Data Grid' - thomas-mclaughlin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
grids and data grids
Grids and Data Grids
  • Grid Computing
    • Large scale problem solving using resources over the internet
    • Distributed computing, but across multiple administrative domains
  • Data Grid
    • Grid with focus on sharing and processing large scale datasets
scientific data analysis on grid based data repositories
Scientific data repositories

Large volume

Gigabyte, Terabyte, Petabyte

Distributed datasets

Generated/collected by scientific simulations or instruments

Data could be streaming in nature

Scientific data analysis

Scientific Data Analysis on Grid-based Data Repositories

Data Specification

Data Organization

Data Extraction

Data Movement

Data Analysis

Data Visualization

opportunities
Opportunities
  • Scientific simulations and data collection instruments generating large scale data
  • Grid standards enabling sharing of data
  • Rapidly increasing wide-area bandwidths
existing efforts
Existing Efforts
  • Data grids recognized as important component of grid/distributed computing
  • Major topics
    • Efficient/Secure Data Movement
    • Replica Selection
    • Metadata catalogs / Metadata services
    • Setting up workflows
open issues
Open Issues
  • Accessing / Retrieving / Processing data from scientific repositories
    • Need to deal with low-level formats
  • Integrating tools and services having/requiring data with different formats
  • Support for processing streaming data in a distributed environment
  • Efficient distributed data-intensive applications
  • Developing scalable data analysis applications
ongoing projects
Ongoing Projects
  • Automatic Data Virtualization
  • On the fly information integration in a distributed environment
  • Middleware for Processing Streaming Data
  • Supporting Coarse-grained pipelined parallelism
  • Compiling XQuery on Scientific and Streaming Data
  • Middleware and Algorithms for Scalable Data Mining
outline
Outline
  • Automatic Data Virtualization
    • Relational/SQL
    • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Cluster and Grid-based data mining middleware
automatic data virtualization motivation
Automatic Data Virtualization: Motivation
  • Emergence of grid-based data repositories
    • Can enable sharing of data in an unprecedented way
  • Access mechanisms for remote repositories
    • Complex low-level formats make accessing and processing of data difficult
  • Main desired functionality
    • Ability to select, down-load, and process a subset of data
data virtualization
Data Virtualization

An abstract view of data

dataset

Data

Virtualization

Data Service

  • By Global Grid Forum’s DAIS working group:
  • A Data Virtualization describes an abstract view of data.
  • A Data Service implements the mechanism to access and process data
  • through the Data Virtualization
our approach automatic data virtualization
Our Approach: Automatic Data Virtualization
  • Automatically create data services
    • A new application of compiler technology
  • A meta-data descriptor describes the layout of data on a repository
  • An abstract view is exposed to the users
  • Two implementations:
    • Relational /SQL-based
    • XML/XQuery based
relational sql implementation

Analysis and Code Generation

Query

frontend

Extract

Service

Relational/SQL Implementation

Meta-data

Descriptor

User Defined

Aggregate

Select Query

Input

Aggregation

Service

design a meta data description language
Design a Meta-data Description Language
  • Requirements
    • Specify the relationship of a dataset to the virtual dataset schema
    • Describe the dataset physical layout within a file
    • Describe the dataset distribution on nodes of one or more clusters
    • Specify the subsetting index attributes
    • Easy to use for data repository administrators and also convenient for our code generation
an example
An Example

Component I: Dataset Schema Description

[IPARS] // { * Dataset schema name *}

REL = short int // {* Data type definition *}

TIME = int

X = float

Y = float

Z = float

SOIL = float

SGAS = float

  • Oil Reservoir Management
    • The dataset comprises several simulation on the same grid
    • For each realization, each grid point, a number of attributes are stored.
    • The dataset is stored on a 4 node cluster.

Component II: Dataset Storage Description

[IparsData] //{* Dataset name *}

//{* Dataset schema for IparsData *}

DatasetDescription = IPARS

DIR[0] = osu0/ipars

DIR[1] = osu1/ipars

DIR[2] = osu2/ipars

DIR[3] = osu3/ipars

evaluate the scalability of our tool
Evaluate the Scalability of Our Tool
  • Scale the number of nodes hosting the Oil reservoir management dataset
  • Extract a subset of interest at the size of 1.3GB
  • The execution times scale almost linearly.
  • The performance difference varies between 5%~34%, with an average difference of 16%.
comparison with an existing database postgresql
Comparison with an existing database (PostgreSQL)

6GB data for Satellite data processing.

The total storage required after loading the

data in PostgreSQL is 18GB.

Create Index for both spatial coordinates

and S1 in PostgreSQL.

No special performance tuning applied for

the experiment.

outline1
Outline
  • Automatic Data Virtualization
    • Relational/SQL
    • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Coarse-grained pipelined parallelism
xml xquery implementation

XQuery

???

XML

XML/XQuery Implementation

HDF5

NetCDF

TEXT

RMDB

programming query language
Programming/Query Language
  • High-level declarative languages ease application development
    • Popularity of Matlab for scientific computations
  • New challenges in compiling them for efficient execution
  • XQuery is a high-level language for processing XML datasets
    • Derived from database, declarative, and functional languages !
    • XPath (a subset of XQuery) embedded in an imperative language is another option
approach contributions
Approach / Contributions
  • Use of XML Schemas to provide high-level abstractions on complex datasets
  • Using XQuery with these Schemas to specify processing
  • Issues in Translation
    • High-level to low-level code
    • Data-centric transformations for locality in low-level codes
    • Issues specific to XQuery
      • Recognizing recursive reductions
      • Type inferencing and translation
slide21

System Architecture

External Schema

XML Mapping Service

logical XML schema

physical XML schema

Compiler

XQuery Sources

C++/C

outline2
Outline
  • Automatic Data Virtualization
    • Relational/SQL
    • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Cluster and Grid-based data mining middleware
overall goal
Overall Goal
  • Tools for data integration driven by:
    • Data explosion
      • Data size & number of data sources
    • New analysis tools
    • Autonomous resources
      • Heterogeneous data representation & various interfaces
    • Frequent Updates
    • Common Situations:
      • Flat-file datasets
      • Ad-hoc sharing of data
current approaches
Current Approaches
  • Manually written wrappers
    • Problems
      • O(N2) wrappers needed, O(N) for a single updates
  • Mediator-based integration systems
    • Problems
      • Need a common intermediate format
      • Unnecessary data transformation
  • Integration using web/grid services
      • Needs all tools to be web-services (all data in XML?)
our approach
Our Approach
  • Automatically generate wrappers
    • Stand-alone programs
    • For integrated DBs, (grid) workflow systems
  • Transform data in files of arbitrary formats
    • No domain- or format-specific heuristics
    • Layout information provided by users
  • Help biologists write layout descriptors using data mining techniques
  • Particularly attractive for
    • flat-file datasets
    • ad hoc data sharing
    • data grid environments
our approach advantages
Our Approach: Advantages
  • Advantages:
    • No DB or query support required
    • One descriptor per resource needed
    • No unnecessary transformation
    • New resources can be integrated on-the-fly
our approach challenges
Our Approach: Challenges
  • Description language
    • Format and logical view of data in flat files
    • Easy to interpret and write
  • Wrapper generation and Execution
    • Correspondence between data items
    • Separating wrapper analysis and execution
  • Interactive tools for writing layout descriptors
    • What data mining techniques to use ?
wrapper generation system overview
Wrapper Generation System Overview

Layout Descriptor

Schema Descriptors

Parser

Mapping Generator

Data Entry Representation

Schema Mapping

Application Analyzer

WRAPINFO

Source

Dataset

Target

Dataset

DataReader

DataWriter

Synchronizer

layout description language
Layout Description Language
  • Goal
    • To describe data in arbitrary flat file format
    • Easy to interpret and write
  • Components:
    • Schema description
    • Layout description
  • Example: FASTA
layout description language1
Layout Description Language

>seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n

>seq3 …

  • Component I: Schema Description

[FASTA] //Schema Name

ID = string //Data type definitions

DESCRIPTION = string

SEQ = string

layout description language2
Key observations on data layout

Strings of variable length

Delimiters widely used

Data fields divided into variables

Repetitive structures

Key tokens

“constant string”

LINESIZE

[optional]

<repeating>

Layout Description Language

>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n

>seq3 …

layout description language3
Layout Description Language

>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n

>seq3 …

  • Component II: Layout Description

LOOP ENTRY 1:EOF:1 {

“>” ID “ ” DESCRIPTION

< “\n” SEQ >

“\n” | EOF

}

outline3
Outline
  • Automatic Data Virtualization
    • Relational/SQL
    • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Coarse-grained pipelined parallelism
streaming data model
Streaming Data Model
  • Continuous data arrival and processing
  • Emerging model for data processing
    • Sources that produce data continuously: sensors, long running simulations
    • WAN bandwidths growing faster than disk bandwidths
  • Active topic in many computer science communities
    • Databases
    • Data Mining
    • Networking ….
summary limitations of current work
Summary/Limitations of Current Work
  • Focus on
    • centralized processing of stream from a single source (databases, data mining)
    • communication only (networking)
  • Many applications involve
    • distributed processing of streams
    • streams from multiple sources
motivating application

X

Network Fault Management System

Motivating Application

Network Fault Management System

Switch Network

motivating application 2
Motivating Application (2)

Computer Vision Based Surveillance

features of distributed streaming processing applications
Features of Distributed Streaming Processing Applications
  • Data sources could be distributed
    • Over a WAN
  • Continuous data arrival
  • Enormous volume
    • Probably can’t communicate it all to one site
  • Results from analysis may be desired at multiple sites
  • Real-time constraints
    • A real-time, high-throughput, distributed processing problem
need for a grid based stream processing middleware
Need for a Grid-Based Stream Processing Middleware
  • Application developers interested in data stream processing
    • Will like to have abstracted
      • Grid standards and interfaces
      • Adaptation function
    • Will like to focus on algorithms only
  • GATES is a middleware for
    • Grid-based
    • Self-adapting

Data Stream Processing

adaptation for real time processing
Adaptation for Real-time Processing
  • Analysis on streaming data is approximate
  • Accuracy and execution rate trade-off can be captured by certain parameters (Adaptation parameters)
    • Sampling Rate
    • Size of summary structure
  • Application developers can expose these parameters and a range of values
api for adaptation
API for Adaptation

Public class Sampling-Stage implements StreamProcessing{

void init(){…}

void work(buffer in, buffer out){

while(true)

{

Image img = get-from-buffer-in-GATES(in);

Image img-sample = Sampling(img, sampling-ratio);

put-to-buffer-in-GATES(img-sample, out);

}

}

GATES.Information-About-Adjustment-Parameter(min, max, 1)

sampling-ratio = GATES.getSuggestedParameter();

outline4
Outline
  • Automatic Data Virtualization
    • Relational/SQL
    • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Cluster and Grid-based data mining middleware
scalable mining problem
Scalable Mining Problem
  • Our understanding of what algorithms and parameters will give desired insights is often limited
  • The time required for creating scalable implementations of different algorithms and running them with different parameters on large datasets slows down the data mining process
mining in a grid environment
Mining in a Grid Environment
  • A data mining application in a grid environment -

- Needs to exploit different forms of available parallelism

- Needs to deal with different data layouts and formats

- Needs to adapt to resource availability

freeride overview
FREERIDE Overview
  • Framework for Rapid Implementation of datamining engines
  • Demonstrated for a variety of standard mining algorithm
  • Targeted distributed memory parallelism, shared memory parallelism, and combination
  • Can be used as basis for scalable grid-based data mining implementations
  • Published in SDM 01, SDM 02, SDM 03, Sigmetrics 02, Europar 02, IPDPS 03, IEEE TKDE (to appear)
freeride g
FREERIDE-G
  • Data processing may not be feasible where the data resides
  • Need to identify resources for data processing
  • Need to abstract data retrieval, movement and parallel processing
group members
Group Members
  • Ph.D students
    • Liang Chen
    • Leo Glimcher
    • Kaushik Sinha
    • Li Weng
    • Xuan Zhang
    • Qian Zhu
  • Recently Graduated
    • Ruoming Jin (Kent State)
    • Wei Du (Yahoo)
    • Xiaogang Li (Wi 06, AskJeeves)
getting involved
Getting Involved
  • Talk to me
  • Most recent papers are available online
  • Sign in for my 888