1 / 21

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets. Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. CCGrid 2012, Ottawa, Canada. Outline. Motivation and Introduction Background System Overview Experiment

magar
Download Presentation

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2012, Ottawa, Canada

  2. Outline Motivation and Introduction Background System Overview Experiment Conclusion

  3. Motivation • Science become increasingly data driven • Strong desire for efficient data analysis • Challenges • Data sizes grow rapidly • Slow IO and Network Bandwidth • An example • Different kinds of subsetting requests • Different scientific data formats

  4. An Example 7.4 days! • GCRM (Global Cloud Resolving Model) • A global atmospheric circulation model

  5. Client-side vs. Sever-side subsetting and aggregation Simple Request Advanced Request

  6. Data Virtualization • Support SQL queries over scientific dataset • Standard • Flexible • Keep data in native format(etc. NetCDF, HDF5) • Compare with other scientific data management tools • SciDB: support for data arrays in parallel • OPeNDAP: no flexible subsetting and aggregation

  7. Our Approach • User-defined subsetting and aggregations • Subsetting: Dimensions, Coordinates, Variables • Aggregation: SUM, AVG, COUNT, MAX, MIN • Support NetCDF data format • Developed by UCAR • Widely used in climate simulation • Parallel Data Access • Data Partition Strategy • Different Parallel Level

  8. Background - NetCDF Metadata Actual value stored in m-d array Time = 1 to 3 Y = 1 to 4 X = 1 to 4

  9. System Architecture Physical Metadata Logical Metadata Parse the SQL expression Parse the metadata file Generate Query Request Partition Criteria: Subsetting: Disk Access Aggregation: Data Transfer Read Data Post-filter data Local Data Aggregation

  10. Data Aggregation • SQL: SELECT SUM(pressure) FROM GCRM Slave Processes Master Process

  11. Data Parallelism Level 1: data file (2 < 12?) Level 2: variable (5 < 12?) Level 3: data block (12)

  12. Experiment Goals • To compare the functionality and performance of our system with OPeNDAP • OPeNDAP makes local data accessible to remote locations regardless of local storage format. • Data Translation Mechanism • No flexible subsetting and aggregation support • To evaluate the parallel scalability of our system • To show how aggregation queries reduce the data transfer cost.

  13. Compare with OPeNDAP for Type 1 Queries • Data size: 4GB • Input: 50 SQL queries • Query Type: queries only include dimensions • Object: • Baseline: NetCDF query time • Our system without parallelism • OPeNDAP • Relative Speedup: 2.34 – 3.10

  14. Compare with OPeNDAP for Type 2, Type 3 Queries • Data size: 4GB • Input: 50 SQL queries • Query Type: queries include coordinates and variables • Object: • Baseline • Our system without parallelism • OPeNDAP + Filter • Relative Speedup: 1.58 – 3.47

  15. Parallel Optimization – Different Data Size • Data size: 4GB – 32GB • Process number: 1 to 16 • Input: select the whole variable • Relative Speedup: • 4 procs: 2.17 – 2.87 • 8 procs: 4.06 – 5.54 • 16 procs: 7.23 – 9.33

  16. Parallel Optimization – Different Queries • Data size: 32GB • Processes number: 1 to16 • Input: 100 SQL queries • Query Type: queries include dimensions, coordinates and variables • Relative Speedup: • 4 procs: 2.20 – 2.92 • 8 procs: 3.95 – 4.21 • 16 procs: 7.25 – 7.74

  17. Data Aggregation - Time • Data size: 16GB • Process number: 1 - 16 • Input: 60 aggregation queries • Query Type: • Only Agg • Agg + Group by + Having • Agg + Group by • Relative Speedup: • 4 procs: 2.61 – 3.08 • 8 procs: 4.31 – 5.52 • 16 procs: 6.65 – 9.54

  18. Data Aggregation – Data Transfer Amount • Data size: 16GB • Process number: 1 - 16 • Input: 60 aggregation queries • Query Type: • Only Agg • Agg + Group by + Having • Agg + Group by

  19. Conclusion Data sizes increase in a fast speed Goal: Find exact data subset as user specifies Data virtualization on top of NetCDF dataset Query request partition and parallel processing A good speedup compared with OPeNDAP

  20. Thanks

  21. Pre-filter Module Phase 1 Phase 2 Phase 3 Dataset Storage Metadata Dataset Logical Metadata Request Partition Strategy

More Related