1 / 73

Downloading the World: Middleware for Computational Science

Downloading the World: Middleware for Computational Science. Bill Howe, David Maier Datalab @ Portland State University. Work sponsored by NSF ITR Program 2001-2006. with thanks to Antonio Baptista, Paul Turner, and the entire CORIE Environmental Science Team at OGI. Biology. Old way:

mkitchen
Download Presentation

Downloading the World: Middleware for Computational Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Downloading the World: Middleware for Computational Science Bill Howe, David Maier Datalab @ Portland State University Work sponsored by NSF ITR Program 2001-2006 with thanks to Antonio Baptista, Paul Turner, and the entire CORIE Environmental Science Team at OGI

  2. Biology • Old way: • Wet lab chemistry • New way: • Microarray • Search GenBank, Ensembl, GDB, SwissProt, Entrez using BLAST, FASTA, GCG, EMBOSS

  3. Astronomy • Old way: • Sign up for telescope time • New way: • Sloan Digital Sky Survey • Systematically mapping ¼ of the entire sky • 12 TB to date, • 15 TB final in 2007

  4. Particle Physics • Old Way: • individual experiments for • individual hypotheses • New way: • Simulation and massive data acquistion from particle colliders • (CMS) (LHC): 15PB / year (2007) • BaBar (SLAC): approaching 1PB • 2/3 is simulation data

  5. Oceanography • Old way: • Field work • Simplified Analytics • New way: • Finite Element Analysis • In situ sensors • CODAR

  6. Science is Changing • Old Science: “Query the world” • Hypothesis-driven observations and experiments • Calculations on toy problems • Data acquisition is the dominant cost • New Science: “Download the world” • Telescopes, satellites, clusters of fast computers lower the cost of data acquisition • Store everything now, formulate hypotheses later • Data analysis is the dominant cost

  7. Our Focus • Claim: Simulation plays an increasingly important role in the physical sciences • Claim: Data Products are the currency of scientific communication • Goal: middleware, dubbed GridFields, for deriving data products from simulation results Science Computational Science

  8. Outline • Introduction • Data Product Examples • Existing Tools • Logical Model • Operators • Optimization

  9. 1994 Northridge aftershock, San Fernando valley http://www.cs.cmu.edu/~quake/quakeviz.html Animations created by Greg Foss, Pittsburgh Supercomputing Center and CMU Quake project, from synthetic data generated by the CMU Quake project

  10. Data SourcesData Products • Results of FEA and other methods • Datasets defined on a grid (synonymously, a mesh) • Potentially distributed across files, servers, the Internet • Heterogeneous representations • netCDF, custom formats, RDBMS

  11. Data Sources Data Products • Visualizations or derived datasets • Involving: • Slices, • Aggregations, • Compositions, • etc.

  12. Outline • Introduction • Data Product Examples • Existing Tools • Logical Model • Operators • Optimization

  13. Local File Visualization D0 D1 Dataset Visualization : Dn Visualization Tools?

  14. Example: VTK Library We want: • Different C++ classes, each dependent on data characteristics. • Changes to data characteristics mean changes to plan • Logical equivalences are obscured… vtkExtractGeometry vtkThreshold vtkExtractGrid vtkExtractVOI vtkThresholdPoints VTK:

  15. Exposing Equivalences for Algebraic Optimization A • B is equivalent to A, but less costly to compute • Implies ‘Cross’ and ‘Restrict’ should be commutative • Difficult to determine which vtk operations, if any, have this property (“Documentational Semantics”) cross restrict B cross restrict

  16. Example: VTK Library • Apply a function to a dataset? • vtkProgrammableAttributeFilter, vtkFunctionParser, others • Aggregate over depth? • no generalized aggregation filter • Construct multidimensional datasets? • grid construction is “manual”

  17. Relational Databases? • Simulation data is primarily read-only, RDBMS optimized for transaction processing • RDBMS requires more space for the same data • Impedance mismatch between RDBMS and visualization tools • Requires scientists to relinquish control of their data’s representation • Difficult to model general gridded datasets with relations

  18. Goals • Allow convenient and efficient manipulation of gridded datasets • Model generalized gridded datasets • Leverage database techniques in this domain • Algebraic reasoning • Cost-based optimization • Cooperate with the scientific computing landscape • custom file formats • visualization tools

  19. Outline • Introduction • Data Product Examples • Existing Tools • Logical Model • Operators • Optimization

  20. Grid y 3 4 E = x A Grid: a set of cells, where each is assigned an integer dimension Giare the cells in G of dimension i Cells are organized by an incidence relation I that respects dimension z 2 I = E0 = {2,3,4} E1 = {x,y,z} E2 = {A} …plus the transitive closure x < y is read “x is incident to y”

  21. Grid Operations • Union • Intersection • Cross Product

  22. = Cross Product (1)  =

  23. 30 y 3 4 y0 x0 40 A x 20 A0 yw z xw z0 3w 2 2w 4w 31 x1 y1 zw 0 A1  = w 41 21 z1 1 Aw Cross Product (2)

  24. GridFields • GridField: A grid with attributes bound to its cells • A grid may have data bound to more than one dimension • A grid G is lifted into a GridField G using the Bind operator. • G = Bind(G, i, f), where f is a function mapping each cell in Gi to a value.

  25. Constructing GridFields (a) (b) (c) (d) (e) (a) H0:(x) (c) H0:()   bind(0,x,y) V0:(y) V0:() (b) bind(0,x) H0:() (d) H0:()   bind(0,x,y) bind(1,f) bind(0,y) V0:() V0:()

  26. GridField Operations • Scan : ()  G(x,y) • Bind : G(x,y)  G(x,y,z) • Restrict : G(x,y,z)  G(x,y,z) • Regrid : G(x,y,z)  F(a,b)  F(,,…) • Lifted grid operations: Union, Intersection, Cross Product

  27. Example Query Plan H : (x,y,b)  r(z>b) render r(region) b(s) V : (z) b(r(H  V)) r(b(r(H  V))) (H  V) r(H  V) H V

  28. Algebraic Optimization F G H(x,y,b)  r(z>b) r(region) b(s) V(z) F r(x,y) H(x,y,b)  r(z>b) b(s) r(z) V(z) *Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005

  29. Optimization Results

  30. Regrid Example 1 E0:(t) 12.6C 13.1C 13.2C 12.8C 12.5C 12.1C Assignment function {12.6C, 13.1C, 13.2C} {12.8C , 12.5C , 12.1C} Aggregation function 12.95C 12.45C F0:(avgt)

  31. Regrid Example 2: Estimate Gradient

  32. Source GridField: G Target GridField: G assignment function: [c | c < d > e, d G2, c,e G0] aggregation function: add up the vectors Regrid Example 2: Estimate Gradient G0:(x,y,s)

  33. Source GridField: G Target GridField: F assignment function: [d | contains(d,c), d G2, cF0] aggregation function: interpolate Regrid Example 3: map values to a different grid G2:(tri,s) F0:(x,y) this is like a spatial join

  34. Operator: Iterate

  35. Topological Path t=0 t=1 t=2 t=3 t=4 t=5 Iterate True Path Source Grid We can’t express this operation as a regrid • The position at t = n depends on positions at t < n, and velocity flow data from the source grid. Target grid, T Source grid, S

  36. R R if (x,y) in R if (x,y) not in R Iterate T S (x0,y0)  (x,y) Rule 1: where (x,y) is a function of (x,y), and the nearby velocity data in S (x,y) (x,y)  Iterate finds a fixpoint of these rules Rule 2: (x,y) (x,y)

  37. Transect Data Product F H(x,y,b)  r(z>b) search b(s) V(z)  P  V P P P  V

  38. Transect: Results 800 MB dataset Not expressible in PostGIS, a geographic extension to PostgreSQL secs

  39. Architecture GridField Processor GridField Processor GridField Processor Query Processor Catalog Optimizer Named Views User Queries Rendering

  40. Other Aspects • Accessing Native Data Formats • Describe arbitrary file formats • Generate access methods for them • Physical Algebra based on Arrays • Similar to Monet’s BAT Algebra [] • Evaluation in other domains • Seismology • Engineering • Physics

  41. Related Work • Data Models P. Moran. Field model: An object-oriented data model for fields. Technical report, NASA Ames Research Center, 2001. A. P. Marathe and K. Salem. A language for manipulating arrays. VLDB 1997 D. M. Butler and M. H. Pendley. A visualization model based on the mathematics of fiber bundles. Computers in Physics, 3(5):45–51, 1989 • Workflow and Visualization X. Ma, M. Winslett, J. Norris, X. Jiao, and R. Fiedler. Godiva: Lightweight data management for scientific visualization applications. ICDE 2004 I. Altintas, S. Bhagwanani, D. Buttler, S. Chandra, Z. Cheng, M. Coleman, T. Critchlow, A. Gupta, W. Han, L. Liu, B. Ludäscher, C. Pu, R. Moore,A. Shoshani, and M. Vouk. A Modeling and Execution Environment for Distributed Scientific Workflows, SSDBM 2003 I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation.SSDBM, 2002.

  42. Related Work • No known solutions with support for: • Fully generalized grids • Algebraic optimization • Native representations

  43. Conclusions • Algebraic reasoning is helpful for simplfying complex data products • Relational Databases are difficult to deploy in these domains • Jim Gray reports some success, however • Performance is competitive with mature tools due to algebraic optimization

  44. ---Backup Slides---

  45. Low-Impact Data Management • Goal: • Derive a cost function for our query plans • But we must accommodate different representations • Claim: Operators and assn/aggr functions can be defined using only 4 logical access methods • So: Implement these for a variety of representations

  46. AIFL Interface • Adjacency: Ai(optional) • for each cell c of dimension i, return cells of dimension i adjacent to c • Incidence: Iik • for each cell c of dimension i, return cells of dimension k incident to c • Field data: Fi • for each cell c of dimension i, return the data bound to c • Lookup: Li • return all the cells of dimension i that have a particular data value bound E d c (xd,yd) (xc,yc) d c

More Related