Tools and Techniques for the Data Grid

Tools and Techniques for the Data Grid Gagan Agrawal

Scientific data repositories Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Data could be streaming in nature Scientific data analysis Scientific Data Analysis on Grid-based Data Repositories Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization

Opportunities • Scientific simulations and data collection instruments generating large scale data • Grid standards enabling sharing of data • Rapidly increasing wide-area bandwidths

Motivating Scientific Applications Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope …

Existing Efforts • Data grids recognized as important component of grid/distributed computing • Major topics • Efficient/Secure Data Movement • Replica Selection • Metadata catalogs / Metadata services • Setting up workflows

Open Issues • Accessing / Retrieving / Processing data from scientific repositories • Need to deal with low-level formats • Integrating tools and services having/requiring data with different formats • Support for processing streaming data in a distributed environment • Efficient distributed data-intensive applications • Developing scalable data analysis applications

Ongoing Projects • Automatic Data Virtualization • On the fly information integration in a distributed environment • Middleware for Processing Streaming Data • Supporting Coarse-grained pipelined parallelism • Compiling XQuery on Scientific and Streaming Data • Middleware and Algorithms for Scalable Data Mining

Outline • Automatic Data Virtualization • Relational/SQL • XML/XQuery based • Information Integration • Middleware for Streaming Data • Coarse-grained pipelined parallelism

Automatic Data Virtualization: Motivation • Emergence of grid-based data repositories • Can enable sharing of data in an unprecedented way • Access mechanisms for remote repositories • Complex low-level formats make accessing and processing of data difficult • Main desired functionality • Ability to select, down-load, and process a subset of data

Current Approaches • Databases • Relational model using SQL • Properties of transactions: Atomicity, Isolation, Durability, Consistency • Good! But is it too heavyweight for read-mostly scientific data ? • Manual implementation based on low-level datasets • Need detailed understanding of low-level formats • HDF5, NetCDF, etc • No single established standard • BinX, BFD, DFDL • Machine readable descriptions, but application is dependent on a specific layout

Data Virtualization An abstract view of data dataset Data Virtualization Data Service • By Global Grid Forum’s DAIS working group: • A Data Virtualization describes an abstract view of data. • A Data Service implements the mechanism to access and process data • through the Data Virtualization

Our Approach: Automatic Data Virtualization • Automatically create data services • A new application of compiler technology • A meta-data descriptor describes the layout of data on a repository • An abstract view is exposed to the users • Two implementations: • Relational /SQL-based • XML/XQuery based

Compiler Analysis and Code Generation Query frontend STORM Extraction Service System Overview Meta-data Descriptor User Defined Aggregate Select Query Input Aggregation Service

Design a Meta-data Description Language • Requirements • Specify the relationship of a dataset to the virtual dataset schema • Describe the dataset physical layout within a file • Describe the dataset distribution on nodes of one or more clusters • Specify the subsetting index attributes • Easy to use for data repository administrators and also convenient for our code generation

Design Overview • Dataset Schema Description Component • Dataset Storage Description Component • Dataset Layout Description Component

An Example Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float • Oil Reservoir Management • The dataset comprises several simulation on the same grid • For each realization, each grid point, a number of attributes are stored. • The dataset is stored on a 4 node cluster. Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars

Compiler Analysis Data _Extract{ Find _File _Groups() Process _File _Groups() } Find _File _Groups{ Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S1, … ,Sm be the m sets T = Ø foreach {s1, … ,sm } si∈ Si { {* cartesian product between S1, … ,Sm *} If the values of implicit attributes are not inconsistent { T = T ∪ {s1, … ,sm } } } Output T } Process _File _Groups{ foreach {s1, … ,sm } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk } } • Meta-data descriptor Create AFC Process AFC Index & Extraction function code

Test the ability of our code generation tool Oil Reservoir Management The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data

Evaluate the Scalability of Our Tool • Scale the number of nodes hosting the Oil reservoir management dataset • Extract a subset of interest at the size of 1.3GB • The execution times scale almost linearly. • The performance difference varies between 5%~34%, with an average difference of 16%.

Comparison with an existing database (PostgreSQL) 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment.

XQuery ??? XML XML/XQuery Implementation HDF5 NetCDF TEXT RMDB …

Programming/Query Language • High-level declarative languages ease application development • Popularity of Matlab for scientific computations • New challenges in compiling them for efficient execution • XQuery is a high-level language for processing XML datasets • Derived from database, declarative, and functional languages ! • XPath (a subset of XQuery) embedded in an imperative language is another option

Approach / Contributions • Use of XML Schemas to provide high-level abstractions on complex datasets • Using XQuery with these Schemas to specify processing • Issues in Translation • High-level to low-level code • Data-centric transformations for locality in low-level codes • Issues specific to XQuery • Recognizing recursive reductions • Type inferencing and translation

System Architecture External Schema XML Mapping Service logical XML schema physical XML schema Compiler XQuery Sources C++/C

Using a High-level Schema • High-level view of the dataset – a simple collection of pixels • Latitude, longitude, and time explicitly stored with each pixel • Easy to specify processing • Don’t care about locality / unnecessary scans • At least one order of magnitude overhead in storage • Suitable as a logical format only

XQuery -A language for querying and processing XML document - Functional language - Single Assignment - Strongly typed XQuery Expression - for let where return(FLWR) - unordered - path expression Unordered( For $d in document(“depts.xml”)//deptno let $e:=document(“emps.xml”)//emp [Deptno= $d] where count($e)>=10 return <big-dept> {$d, <count> {count($e) }</count> <avg> {avg($e/salary)}<avg> } </big-dept> ) XQuery Overview

Unordered ( for $i in ( $minx to $maxx) for $j in ($miny to $maxy) let p:=document(“sate.xml”) /data/pixel where lat = i and long = j return <pixel> <latitude> {$i} </latitude> <longitude> {$j} <longitude> <sum>{accumulate($p)}</sum> </pixel> ) Define function accumulate ($p) as double { let $inp := item-at($p,1) let $NVDI := (( $inp/band1 -$inp/band0)div($inp/band1+$inp/band0)+1)*512 return if (empty( $p) ) then 0 else { max($NVDI, accumulate(subsequence ($p, 2 ))) } Satellite- XQuery Code

Challenges • Need to translate to low-level schema • Focus on correctness and avoiding unnecessary reads • Enhancing locality • Data-centric execution on XQuery constructs • Use information on low-level data layout • Issues specific to XQuery • Reductions expressed as recursive functions • Generating code in an imperative language • For either direct compilation or use a part of a runtime system • Requires type conversion

Introduction :Information Integration • Goal: to provide a uniform access/query interface to multiple heterogeneous sources. • Challenges: • Global schema • Query optimization • Resource discovery • Ontology discrepancy • etc.

Introduction: Wrapper • Goal: to provide the integration system transparent access to data sources • Challenges • Development cost • Performance • Transportability

Roadmap • Introduction • System Overview • Meta Data Description Language • Wrapper Generation • Conclusion

Overview: Main Components • User’s view of the data – Meta data description language • Mapping between input and output schema • Schema mapping • Parse inputs and generate outputs • DataReader and DataWriter

System Overview Meta Data Descriptor Parser Mapping Generator Internal Data Entry Representation Schema Mapping Code Generator Target Dataset Source Dataset DataReader DataWriter Integrator

Meta Data Descriptor (1) • Design Goals: • Easy to interpret and process • Easy to write • Sufficiently expressive • Added features (for bioinformatics datasets): • Strings with no fixed size • Delimiters are used for separating fields • Fields may be divided into lines/variables • Total number of items unknown

Meta Data Descriptor (2) • Component I. Schema Description Schema name Data field name [FASTA] ID = string DESCRIPTION = string SEQ = string Data type

Meta Data Descriptor (3) • Component II. Layout Description Dataset name Schema name DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION < “\n” SEQ > “\n” | EOF } } DATA {osu/fasta} } ID DESCRIPTION >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT… SEQ SEQ File layout SEQ SEQ File location

Input SWISSPROT data … LOOP ENTRY 1:EOF:1 { “ID” ID <“\nAC” AC_NO> LOOP I 1:3:1 {“\nDT” DATE} <“\nDE” DESCRIPTION> <“\nGN” GENE_NAME> <“\nOS” ORG_SPEC> [“\nOG” ORGANELLE] <“\nOC” ORG_CLASS> … <“\nSQ SEQUENCE” LENGTH “AA;” MOL_WT “MW;” CRC “CRC64;” <“\n” SEQ> “\n//\n” | EOF } … ID CRAM_CRAAB AC P01542; DT 21-JUL-1986 (Rel. 01, Created) DT 21-JUL-1986 (Rel. 01, Last sequence update) DT 01-NOV-1997 (Rel. 35, Last annotation update) DE CRAMBIN. GN THI2. OS Crambe abyssinica (Abyssinian crambe). OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; OC core eudicots; Rosidae; eurosids II; Brassicales; Brassicaceae; OC Crambe. … SQ SEQUENCE 46 AA; 4736 MW; F6ADE458 CRC64; >P1;CRAM_CRAAB TTCCPSIVAR SNFNVCRLPG TPEAICATYT GCIIIPGATC PGDYAN \\ … Meta Data Descriptor (4)

Wrapper Generation:Mapping Generator • Goal: Generate schema mapping from schema descriptors • Criteria: Strict name matching [SWISSPROT]:[FASTA] ID:ID SEQ:SEQ [input schema]: [output schema] source field : target field “from SWISSPROT” DESCRIPTION:DESCRIPTION

Wrapper Generation: Parser • Key Observation: • Data stored in entry-wise manner LOOP ENTRY 1:EOF:1 { … single data entry … } • Entry made of delimiter-variable pairs with “environment” symbols in between

Wrapper Generation: Parse Tree LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION < “\n” SEQ > “\n” | EOF } Data Entry “>”-ID “ “-DESCRIPTION < > “\n”-DUMMY|EOF “\n”-SEQ

Wrapper Generation:Code Generator • Create two application specific modules • DataReader: • Scans the source data file; • Locates DLM-VAR pair; • Submits variable required by the target with its order. • Data Writer: • Takes in the variable and its order; • Looks up DLM-VAR pair; • Checks linesize; • Writes target file.

Streaming Data Model • Continuous data arrival and processing • Emerging model for data processing • Sources that produce data continuously: sensors, long running simulations • WAN bandwidths growing faster than disk bandwidths • Active topic in many computer science communities • Databases • Data Mining • Networking ….

Summary/Limitations of Current Work • Focus on • centralized processing of stream from a single source (databases, data mining) • communication only (networking) • Many applications involve • distributed processing of streams • streams from multiple sources

X Network Fault Management System Motivating Application Network Fault Management System Switch Network

Motivating Application (2) Computer Vision Based Surveillance

Features of Distributed Streaming Processing Applications • Data sources could be distributed • Over a WAN • Continuous data arrival • Enormous volume • Probably can’t communicate it all to one site • Results from analysis may be desired at multiple sites • Real-time constraints • A real-time, high-throughput, distributed processing problem

Switch Network X Motivation • Challenges & Possible Solutions • Challenge1: Data, Communication, and/or Compute- Intensive

Tools and Techniques for the Data Grid