730 likes | 747 Views
Explore how downloading data revolutionizes various scientific fields, from oceanography to particle physics. Learn how simulation and data products are transforming scientific processes. Discover GridFields, a middleware for deriving data products from simulation results.
E N D
Downloading the World: Middleware for Computational Science Bill Howe, David Maier Datalab @ Portland State University Work sponsored by NSF ITR Program 2001-2006 with thanks to Antonio Baptista, Paul Turner, and the entire CORIE Environmental Science Team at OGI
Biology • Old way: • Wet lab chemistry • New way: • Microarray • Search GenBank, Ensembl, GDB, SwissProt, Entrez using BLAST, FASTA, GCG, EMBOSS
Astronomy • Old way: • Sign up for telescope time • New way: • Sloan Digital Sky Survey • Systematically mapping ¼ of the entire sky • 12 TB to date, • 15 TB final in 2007
Particle Physics • Old Way: • individual experiments for • individual hypotheses • New way: • Simulation and massive data acquistion from particle colliders • (CMS) (LHC): 15PB / year (2007) • BaBar (SLAC): approaching 1PB • 2/3 is simulation data
Oceanography • Old way: • Field work • Simplified Analytics • New way: • Finite Element Analysis • In situ sensors • CODAR
Science is Changing • Old Science: “Query the world” • Hypothesis-driven observations and experiments • Calculations on toy problems • Data acquisition is the dominant cost • New Science: “Download the world” • Telescopes, satellites, clusters of fast computers lower the cost of data acquisition • Store everything now, formulate hypotheses later • Data analysis is the dominant cost
Our Focus • Claim: Simulation plays an increasingly important role in the physical sciences • Claim: Data Products are the currency of scientific communication • Goal: middleware, dubbed GridFields, for deriving data products from simulation results Science Computational Science
Outline • Introduction • Data Product Examples • Existing Tools • Logical Model • Operators • Optimization
1994 Northridge aftershock, San Fernando valley http://www.cs.cmu.edu/~quake/quakeviz.html Animations created by Greg Foss, Pittsburgh Supercomputing Center and CMU Quake project, from synthetic data generated by the CMU Quake project
Data SourcesData Products • Results of FEA and other methods • Datasets defined on a grid (synonymously, a mesh) • Potentially distributed across files, servers, the Internet • Heterogeneous representations • netCDF, custom formats, RDBMS
Data Sources Data Products • Visualizations or derived datasets • Involving: • Slices, • Aggregations, • Compositions, • etc.
Outline • Introduction • Data Product Examples • Existing Tools • Logical Model • Operators • Optimization
Local File Visualization D0 D1 Dataset Visualization : Dn Visualization Tools?
Example: VTK Library We want: • Different C++ classes, each dependent on data characteristics. • Changes to data characteristics mean changes to plan • Logical equivalences are obscured… vtkExtractGeometry vtkThreshold vtkExtractGrid vtkExtractVOI vtkThresholdPoints VTK:
Exposing Equivalences for Algebraic Optimization A • B is equivalent to A, but less costly to compute • Implies ‘Cross’ and ‘Restrict’ should be commutative • Difficult to determine which vtk operations, if any, have this property (“Documentational Semantics”) cross restrict B cross restrict
Example: VTK Library • Apply a function to a dataset? • vtkProgrammableAttributeFilter, vtkFunctionParser, others • Aggregate over depth? • no generalized aggregation filter • Construct multidimensional datasets? • grid construction is “manual”
Relational Databases? • Simulation data is primarily read-only, RDBMS optimized for transaction processing • RDBMS requires more space for the same data • Impedance mismatch between RDBMS and visualization tools • Requires scientists to relinquish control of their data’s representation • Difficult to model general gridded datasets with relations
Goals • Allow convenient and efficient manipulation of gridded datasets • Model generalized gridded datasets • Leverage database techniques in this domain • Algebraic reasoning • Cost-based optimization • Cooperate with the scientific computing landscape • custom file formats • visualization tools
Outline • Introduction • Data Product Examples • Existing Tools • Logical Model • Operators • Optimization
Grid y 3 4 E = x A Grid: a set of cells, where each is assigned an integer dimension Giare the cells in G of dimension i Cells are organized by an incidence relation I that respects dimension z 2 I = E0 = {2,3,4} E1 = {x,y,z} E2 = {A} …plus the transitive closure x < y is read “x is incident to y”
Grid Operations • Union • Intersection • Cross Product
= Cross Product (1) =
30 y 3 4 y0 x0 40 A x 20 A0 yw z xw z0 3w 2 2w 4w 31 x1 y1 zw 0 A1 = w 41 21 z1 1 Aw Cross Product (2)
GridFields • GridField: A grid with attributes bound to its cells • A grid may have data bound to more than one dimension • A grid G is lifted into a GridField G using the Bind operator. • G = Bind(G, i, f), where f is a function mapping each cell in Gi to a value.
Constructing GridFields (a) (b) (c) (d) (e) (a) H0:(x) (c) H0:() bind(0,x,y) V0:(y) V0:() (b) bind(0,x) H0:() (d) H0:() bind(0,x,y) bind(1,f) bind(0,y) V0:() V0:()
GridField Operations • Scan : () G(x,y) • Bind : G(x,y) G(x,y,z) • Restrict : G(x,y,z) G(x,y,z) • Regrid : G(x,y,z) F(a,b) F(,,…) • Lifted grid operations: Union, Intersection, Cross Product
Example Query Plan H : (x,y,b) r(z>b) render r(region) b(s) V : (z) b(r(H V)) r(b(r(H V))) (H V) r(H V) H V
Algebraic Optimization F G H(x,y,b) r(z>b) r(region) b(s) V(z) F r(x,y) H(x,y,b) r(z>b) b(s) r(z) V(z) *Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005
Regrid Example 1 E0:(t) 12.6C 13.1C 13.2C 12.8C 12.5C 12.1C Assignment function {12.6C, 13.1C, 13.2C} {12.8C , 12.5C , 12.1C} Aggregation function 12.95C 12.45C F0:(avgt)
Source GridField: G Target GridField: G assignment function: [c | c < d > e, d G2, c,e G0] aggregation function: add up the vectors Regrid Example 2: Estimate Gradient G0:(x,y,s)
Source GridField: G Target GridField: F assignment function: [d | contains(d,c), d G2, cF0] aggregation function: interpolate Regrid Example 3: map values to a different grid G2:(tri,s) F0:(x,y) this is like a spatial join
Topological Path t=0 t=1 t=2 t=3 t=4 t=5 Iterate True Path Source Grid We can’t express this operation as a regrid • The position at t = n depends on positions at t < n, and velocity flow data from the source grid. Target grid, T Source grid, S
R R if (x,y) in R if (x,y) not in R Iterate T S (x0,y0) (x,y) Rule 1: where (x,y) is a function of (x,y), and the nearby velocity data in S (x,y) (x,y) Iterate finds a fixpoint of these rules Rule 2: (x,y) (x,y)
Transect Data Product F H(x,y,b) r(z>b) search b(s) V(z) P V P P P V
Transect: Results 800 MB dataset Not expressible in PostGIS, a geographic extension to PostgreSQL secs
Architecture GridField Processor GridField Processor GridField Processor Query Processor Catalog Optimizer Named Views User Queries Rendering
Other Aspects • Accessing Native Data Formats • Describe arbitrary file formats • Generate access methods for them • Physical Algebra based on Arrays • Similar to Monet’s BAT Algebra [] • Evaluation in other domains • Seismology • Engineering • Physics
Related Work • Data Models P. Moran. Field model: An object-oriented data model for fields. Technical report, NASA Ames Research Center, 2001. A. P. Marathe and K. Salem. A language for manipulating arrays. VLDB 1997 D. M. Butler and M. H. Pendley. A visualization model based on the mathematics of fiber bundles. Computers in Physics, 3(5):45–51, 1989 • Workflow and Visualization X. Ma, M. Winslett, J. Norris, X. Jiao, and R. Fiedler. Godiva: Lightweight data management for scientific visualization applications. ICDE 2004 I. Altintas, S. Bhagwanani, D. Buttler, S. Chandra, Z. Cheng, M. Coleman, T. Critchlow, A. Gupta, W. Han, L. Liu, B. Ludäscher, C. Pu, R. Moore,A. Shoshani, and M. Vouk. A Modeling and Execution Environment for Distributed Scientific Workflows, SSDBM 2003 I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation.SSDBM, 2002.
Related Work • No known solutions with support for: • Fully generalized grids • Algebraic optimization • Native representations
Conclusions • Algebraic reasoning is helpful for simplfying complex data products • Relational Databases are difficult to deploy in these domains • Jim Gray reports some success, however • Performance is competitive with mature tools due to algebraic optimization
Low-Impact Data Management • Goal: • Derive a cost function for our query plans • But we must accommodate different representations • Claim: Operators and assn/aggr functions can be defined using only 4 logical access methods • So: Implement these for a variety of representations
AIFL Interface • Adjacency: Ai(optional) • for each cell c of dimension i, return cells of dimension i adjacent to c • Incidence: Iik • for each cell c of dimension i, return cells of dimension k incident to c • Field data: Fi • for each cell c of dimension i, return the data bound to c • Lookup: Li • return all the cells of dimension i that have a particular data value bound E d c (xd,yd) (xc,yc) d c