1 / 24

Supporting SQL-3 Aggregations on Grid-based Data Repositories

Supporting SQL-3 Aggregations on Grid-based Data Repositories. Li Weng, Gagan Agrawal, Umit Catalyurek, Joel Saltz. Scientific data repositories Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments

sveta
Download Presentation

Supporting SQL-3 Aggregations on Grid-based Data Repositories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal, Umit Catalyurek, Joel Saltz

  2. Scientific data repositories Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Multi-dimensional datasets Dimension attributes, measure attributes Scientific data analysis Scientific Data Analysis on Grid-based Data Repositories Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization

  3. Motivating Scientific Applications Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope …

  4. Current Approaches • Databases • Relational model using SQL • Properties of transactions: Atomicity, Isolation, Durability, Consistency • Good! But is it too heavyweight for read-mostly scientific data ? • Manual implementation based on low-level datasets • Need detailed understanding of low-level formats • HDF5, NetCDF, etc • Depend on the access behavior API • No single established standard • BinX, BFD, DFDL • Machine readable descriptions, but application is dependent on a specific layout

  5. Our Approach • Express the query & the computing declaratively on a virtual relational table view • Dataset in complex, low-level layouts can be abstracted as SQL-3 table to scientists. • Support basic SELECT query for specifying subset of interest. • Data analysis on subset of interest can be defined as SQL-3 aggregate function on SQL-3 relation.

  6. Our Approach • Compilation based system • Meta-data descriptor • Generate data extracting service & data aggregation service • A lightweight layer on top of datasets • A runtime middleware STORM is used to work in coordination with the generated services.

  7. Compiler Analysis and Code Generation Query frontend STORM Extraction Service System Overview Meta-data Descriptor User Defined Aggregate Select Query Input Aggregation Service

  8. Outline • Introduction • Motivation • system overview • System design and algorithm • Canonical query structure and one example • Compiler analysis and code generation • Design a meta-data descriptor • Data extraction service & Data aggregation service • Experimental results • Related work • Conclusions

  9. Canonical Query Structure SELECT <attribute list> , <AGG_name(Dataset Name)> From <Dataset Name> WHERE <Expression> GROUP BY <group-by attribute_list>; CREATE AGGREGATE <AGG_name> (BASETYPE = <Dataset Name>, SFUNC = <sfunc>, STYPE = <state_type> [, FINALFUNC = <ffunc>] [, INITCOND = <initial_condition>] ) CREATE FUNCTION <func> (<AGG_status>, <Dataset Name> ) RETURNS <rettype> AS ‘ <SQL statement list> ’ LANGUAGE SQL;

  10. Oil Reservoir Management (IPARS) SELECT X, Y, Z, ipars_bypass_sum(IPARS) FROM IPARS WHERE REL in (0,5,10) AND TIME >= 1000 AND TIME <= 1200 GROUP BY X, Y, Z HAVING ipars_bypass_sum(OIL)>0; CREATE AGGREGATE ipars_bypass_sum ( BASETYPE = IPARS, SFUNC = ipars_func, STYPE = int, INITCOND = '1' ); CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.soil > 0.7 AND |/($2.oilx * $2.oilx + $2.oily * $2.oily + $2.oilz * $2.oilz)<30.0 THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL;

  11. Compiler Analysis and Code Generation • Transform the canonical query into two pipelined sub-queries. • Data Extraction Service TempDataset = SELECT <all attributes> From <Dataset Name> WHERE <Expression> ; • Data Aggregation Service SELECT <attribute list> , <AGG_name(Dataset Name)> FROM TempDataset GROUP BY <group-by attribute_list>;

  12. Design a Meta-data Descriptor-- An Example Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float • Oil Reservoir Management • The dataset comprises several simulations on the same grid • For each realization, each grid point, a number of attributes are stored. • The dataset is stored on a 4 node cluster. Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars

  13. An Example Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1, DATASET ipars2} DATASET “ipars1” { DATASPACE { LOOP GRID ( $DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} } • Oil Reservoir Management • Use LOOP keyword for capturing the repetitive structure within a file. • The grid has 4 partitions (0~3). • “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.

  14. Generate Data Extraction Service Using Meta-data • Aligned file chunks {num_rows, {File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} } • Our tool parses the meta-data descriptor and generates function codes. • At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks.

  15. Generate Data Aggregation Service • Aggregate function analysis • Projection push-down helps to extract data only needed for a particular query and its aggregation. TempDataset = SELECT <useful attributes> From <Dataset Name> WHERE <Expression> ; • As for the IPARS application, only 7 out of the 22 attributes are actually needed for the considered query. The reduction of the data volume to be retrieved and communicated is 66%. • As for the TITAN application, 5 out of the 8 attributes are actually needed and the reduction is 38%.

  16. Generate Data Aggregation Service 2. Aggregate function decomposition • The first step involves computations applied on each tuple; The second step updates the aggregate status variable. • Replace the largest expression with TempAttr. As for the IPARS, the number of attributes is reduced further from 7 to 4. CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.TempAttr THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL;

  17. Generate Data Aggregation Service • Partition the subset of interest based on the values of the group-by attributes if more client nodes are provided as the computing unit. • Construct a hash-table using the values of the group-by attributes as the hash-key. And translate the aggregate function in SQL-3 into the imperative C/C++ code.

  18. Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. • Scalability test when varying the number of nodes for hosting data and performing the computations; • Performance test when the amount of data to be processed is increased; • Comparison with hand-written code; • The impact of the aggregation decomposition.

  19. Experimental Results for IPARS • Scale the number of nodes hosting the data and the number of nodes for processing. • Extract a subset of interest at the size of 640MB from scanning the 1.9GB data. • The execution times scale almost linearly. • The performance difference varies between 6%~20%, with an average difference of 14%. • The aggregate decomposition can reduce the difference to be between 1% and 10%.

  20. Experimental Results for IPARS • Evaluate the system’s ability to scale to larger datasets. • Use 8 data source nodes and 8 client nodes. • The execution time stays proportional to the amount of data to be retrieved and processed.

  21. Experimental Results for TITAN • Scale the number of nodes hosting the data and the number of nodes for processing. • Extract a subset of interest at the size of 228MB from scanning the 456MB data. • The execution times scale almost linearly. • The performance difference is 17%. • The aggregate decomposition can reduce the difference to be 6%.

  22. Experimental Results for TITAN • Evaluate the system’s ability to scale to larger datasets. • Use 8 data source nodes and 8 client nodes. • The execution time stays proportional to the amount of data to be retrieved and processed.

  23. Related Work • Parallel / distributed databases • Parallelization of SQL-based aggregation and reductions • Data cubes • External tables in Oracle • Reduction research in parallelizing compilers • Runtime strategies for supporting reductions in a distributed environment

  24. Conclusions • A compiler-based system for supporting SQL-3 aggregate function and select query with group-by operator on flat-file scientific datasets. • Both the extraction of the subset of interest and the aggregate computing can be expressed declaratively. • By using a meta-data descriptor to represent the layout of the dataset, our compiler generates efficient data extraction service. • The compiler analyzes the user-define aggregate function and generate code in a parallel environment.

More Related