Downloading the World: Middleware for Computational Science

Downloading the World: Middleware for Computational Science Bill Howe, David Maier Datalab @ Portland State University Work sponsored by NSF ITR Program 2001-2006 with thanks to Antonio Baptista, Paul Turner, and the entire CORIE Environmental Science Team at OGI

Biology • Old way: • Wet lab chemistry • New way: • Microarray • Search GenBank, Ensembl, GDB, SwissProt, Entrez using BLAST, FASTA, GCG, EMBOSS

Astronomy • Old way: • Sign up for telescope time • New way: • Sloan Digital Sky Survey • Systematically mapping ¼ of the entire sky • 12 TB to date, • 15 TB final in 2007

Particle Physics • Old Way: • individual experiments for • individual hypotheses • New way: • Simulation and massive data acquistion from particle colliders • (CMS) (LHC): 15PB / year (2007) • BaBar (SLAC): approaching 1PB • 2/3 is simulation data

Oceanography • Old way: • Field work • Simplified Analytics • New way: • Finite Element Analysis • In situ sensors • CODAR

Science is Changing • Old Science: “Query the world” • Hypothesis-driven observations and experiments • Calculations on toy problems • Data acquisition is the dominant cost • New Science: “Download the world” • Telescopes, satellites, clusters of fast computers lower the cost of data acquisition • Store everything now, formulate hypotheses later • Data analysis is the dominant cost

Our Focus • Claim: Simulation plays an increasingly important role in the physical sciences • Claim: Data Products are the currency of scientific communication • Goal: middleware, dubbed GridFields, for deriving data products from simulation results Science Computational Science

Outline • Introduction • Data Product Examples • Existing Tools • Logical Model • Operators • Optimization

1994 Northridge aftershock, San Fernando valley http://www.cs.cmu.edu/~quake/quakeviz.html Animations created by Greg Foss, Pittsburgh Supercomputing Center and CMU Quake project, from synthetic data generated by the CMU Quake project

Data SourcesData Products • Results of FEA and other methods • Datasets defined on a grid (synonymously, a mesh) • Potentially distributed across files, servers, the Internet • Heterogeneous representations • netCDF, custom formats, RDBMS

Data Sources Data Products • Visualizations or derived datasets • Involving: • Slices, • Aggregations, • Compositions, • etc.

Local File Visualization D0 D1 Dataset Visualization : Dn Visualization Tools?

Example: VTK Library We want: • Different C++ classes, each dependent on data characteristics. • Changes to data characteristics mean changes to plan • Logical equivalences are obscured… vtkExtractGeometry vtkThreshold vtkExtractGrid vtkExtractVOI vtkThresholdPoints VTK:

Exposing Equivalences for Algebraic Optimization A • B is equivalent to A, but less costly to compute • Implies ‘Cross’ and ‘Restrict’ should be commutative • Difficult to determine which vtk operations, if any, have this property (“Documentational Semantics”) cross restrict B cross restrict

Example: VTK Library • Apply a function to a dataset? • vtkProgrammableAttributeFilter, vtkFunctionParser, others • Aggregate over depth? • no generalized aggregation filter • Construct multidimensional datasets? • grid construction is “manual”

Relational Databases? • Simulation data is primarily read-only, RDBMS optimized for transaction processing • RDBMS requires more space for the same data • Impedance mismatch between RDBMS and visualization tools • Requires scientists to relinquish control of their data’s representation • Difficult to model general gridded datasets with relations

Goals • Allow convenient and efficient manipulation of gridded datasets • Model generalized gridded datasets • Leverage database techniques in this domain • Algebraic reasoning • Cost-based optimization • Cooperate with the scientific computing landscape • custom file formats • visualization tools

Grid y 3 4 E = x A Grid: a set of cells, where each is assigned an integer dimension Giare the cells in G of dimension i Cells are organized by an incidence relation I that respects dimension z 2 I = E0 = {2,3,4} E1 = {x,y,z} E2 = {A} …plus the transitive closure x < y is read “x is incident to y”

Grid Operations • Union • Intersection • Cross Product

 = Cross Product (1)  =

30 y 3 4 y0 x0 40 A x 20 A0 yw z xw z0 3w 2 2w 4w 31 x1 y1 zw 0 A1  = w 41 21 z1 1 Aw Cross Product (2)

GridFields • GridField: A grid with attributes bound to its cells • A grid may have data bound to more than one dimension • A grid G is lifted into a GridField G using the Bind operator. • G = Bind(G, i, f), where f is a function mapping each cell in Gi to a value.

Constructing GridFields (a) (b) (c) (d) (e) (a) H0:(x) (c) H0:()   bind(0,x,y) V0:(y) V0:() (b) bind(0,x) H0:() (d) H0:()   bind(0,x,y) bind(1,f) bind(0,y) V0:() V0:()

GridField Operations • Scan : ()  G(x,y) • Bind : G(x,y)  G(x,y,z) • Restrict : G(x,y,z)  G(x,y,z) • Regrid : G(x,y,z)  F(a,b)  F(,,…) • Lifted grid operations: Union, Intersection, Cross Product

Example Query Plan H : (x,y,b)  r(z>b) render r(region) b(s) V : (z) b(r(H  V)) r(b(r(H  V))) (H  V) r(H  V) H V

Algebraic Optimization F G H(x,y,b)  r(z>b) r(region) b(s) V(z) F r(x,y) H(x,y,b)  r(z>b) b(s) r(z) V(z) *Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005

Optimization Results

Regrid Example 1 E0:(t) 12.6C 13.1C 13.2C 12.8C 12.5C 12.1C Assignment function {12.6C, 13.1C, 13.2C} {12.8C , 12.5C , 12.1C} Aggregation function 12.95C 12.45C F0:(avgt)

Regrid Example 2: Estimate Gradient

Source GridField: G Target GridField: G assignment function: [c | c < d > e, d G2, c,e G0] aggregation function: add up the vectors Regrid Example 2: Estimate Gradient G0:(x,y,s)

Source GridField: G Target GridField: F assignment function: [d | contains(d,c), d G2, cF0] aggregation function: interpolate Regrid Example 3: map values to a different grid G2:(tri,s) F0:(x,y) this is like a spatial join

Operator: Iterate

Topological Path t=0 t=1 t=2 t=3 t=4 t=5 Iterate True Path Source Grid We can’t express this operation as a regrid • The position at t = n depends on positions at t < n, and velocity flow data from the source grid. Target grid, T Source grid, S

R R if (x,y) in R if (x,y) not in R Iterate T S (x0,y0)  (x,y) Rule 1: where (x,y) is a function of (x,y), and the nearby velocity data in S (x,y) (x,y)  Iterate finds a fixpoint of these rules Rule 2: (x,y) (x,y)

Transect Data Product F H(x,y,b)  r(z>b) search b(s) V(z)  P  V P P P  V

Transect: Results 800 MB dataset Not expressible in PostGIS, a geographic extension to PostgreSQL secs

Architecture GridField Processor GridField Processor GridField Processor Query Processor Catalog Optimizer Named Views User Queries Rendering

Other Aspects • Accessing Native Data Formats • Describe arbitrary file formats • Generate access methods for them • Physical Algebra based on Arrays • Similar to Monet’s BAT Algebra [] • Evaluation in other domains • Seismology • Engineering • Physics

Related Work • Data Models P. Moran. Field model: An object-oriented data model for fields. Technical report, NASA Ames Research Center, 2001. A. P. Marathe and K. Salem. A language for manipulating arrays. VLDB 1997 D. M. Butler and M. H. Pendley. A visualization model based on the mathematics of fiber bundles. Computers in Physics, 3(5):45–51, 1989 • Workflow and Visualization X. Ma, M. Winslett, J. Norris, X. Jiao, and R. Fiedler. Godiva: Lightweight data management for scientific visualization applications. ICDE 2004 I. Altintas, S. Bhagwanani, D. Buttler, S. Chandra, Z. Cheng, M. Coleman, T. Critchlow, A. Gupta, W. Han, L. Liu, B. Ludäscher, C. Pu, R. Moore,A. Shoshani, and M. Vouk. A Modeling and Execution Environment for Distributed Scientific Workflows, SSDBM 2003 I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation.SSDBM, 2002.

Related Work • No known solutions with support for: • Fully generalized grids • Algebraic optimization • Native representations

Conclusions • Algebraic reasoning is helpful for simplfying complex data products • Relational Databases are difficult to deploy in these domains • Jim Gray reports some success, however • Performance is competitive with mature tools due to algebraic optimization

---Backup Slides---

Low-Impact Data Management • Goal: • Derive a cost function for our query plans • But we must accommodate different representations • Claim: Operators and assn/aggr functions can be defined using only 4 logical access methods • So: Implement these for a variety of representations

AIFL Interface • Adjacency: Ai(optional) • for each cell c of dimension i, return cells of dimension i adjacent to c • Incidence: Iik • for each cell c of dimension i, return cells of dimension k incident to c • Field data: Fi • for each cell c of dimension i, return the data bound to c • Lookup: Li • return all the cells of dimension i that have a particular data value bound E d c (xd,yd) (xc,yc) d c

Downloading the World: Middleware for Computational Science

Downloading the World: Middleware for Computational Science

Presentation Transcript

Science In An Exponential World

Middleware Design

Computational Science for Energy

Middleware

Chapter 5 Middleware and IoT

A Model for Computational Science Investigations

National Science Foundation Middleware Initiative (NMI)

Computational approaches to vision science

EGEE Middleware

e-Science Technology/Middleware (Grid, Cyberinfrastructure) Gap Analysis and OMII

Computer Science and Computational Science

Introduction to Computational Linguistics

FutureGrid

Computational Science: Middle Schoolers, Real World Problems, and Visualization

Computational Thinking

Lightweight grid computing workshop, 3rd May 2006

Science, Computational Science, and Computer Science TU/Kaiserslautern 13 January 2009

Computational Steering on Grids

Future UK e-Science Grid Middleware

Computational Science and Modeling

GRID Middleware for Biomolecular Science Applications: A User’s Perspective

Computational Biology and Approaches