1 / 34

What to do with Scientific Data?

What to do with Scientific Data?. Michael Stonebraker. Outline. Science data -- what it looks like Software options File system RDBMS SciDB. O(100) petabytes. LSST Data. Raw imagery 2D arrays of telescope readings “Cooked” into observations Image intensity algorithm (data clustering)

lula
Download Presentation

What to do with Scientific Data?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What to do with Scientific Data? Michael Stonebraker

  2. Outline Science data -- what it looks like Software options File system RDBMS SciDB

  3. O(100) petabytes

  4. LSST Data Raw imagery 2D arrays of telescope readings “Cooked” into observations Image intensity algorithm (data clustering) Spatial data Further cooked into “trajectories” Similarity query Constrained by maximum distance

  5. Example LSST Queries Re-cook raw imagery with my algorithm Find all observations in a spatial region Find all trajectories that intersect a cylinder in time

  6. Chemical Plant Data Plant is a directed graph of plumbing Sensors at various places 1/sec observations Directed graph of time series Plant runs “near the edge” to optimize output And fails every once in a while -- down for a week

  7. Chemical Plant Data Record all data {(time, sensor-1, … , sensor-5000)} Look for interesting events i.e. sensor values out of whack Cluster events near each other in 5000 dimension space Goal is to identify “near-failure modes”

  8. Snow Cover in the Sierras

  9. Earth Science Data Raw imagery – arrays “cooked” into composite images based on multiple passes Cloud free cells “baked” into snow cover, …

  10. General Model sensors Derived data Cooking Algorithm(s) (pipeline)

  11. Traditional Wisdom (1) Cooking pipeline in hard code or custom hardware All data in some file system

  12. Problems Can’t find anything Metadata often not recorded Sensor parameters Cooking parameters Can’t easily share anything Can easily recook anything Everything is custom code

  13. Traditional Wisdom (2) Cooking pipeline outside DBMS Derived data loaded into DBMS for subsequent querying

  14. Problems with This Approach Most of the issues remain Better, but stay tuned

  15. My Preference Load the raw data into a DBMS Cooking pipeline is a collection of user-defined functions (DBMS extensions) Activated by triggers or a workflow management system ALL data captured in a common system!!!!

  16. What DBMS to Use? RDBMS (e.g. Oracle) Pretty hopeless on raw data Simulating arrays on top of tables likely to cost a factor of 10-100 Not pretty on time series data Find me a sensor reading whose average value over the last 3 days is within 1% of the average value adjoining 5 sensors

  17. RDBMS Summary Wrong data model Arrays not tables Wrong operations Regrid not join Missing features Versions, no-overwrite, provenance, support for uncertain data, …

  18. But your mileage may vary SQL Server working well for Sloan Skyserver database See paper in CIDR 2009 by Jose Blakeley

  19. How to Do Analytics (e.g. clustering) Suck out the data Convert to array format Pass to Matlab, SAS, R Compute Return answer to DBMS

  20. Bad News Painful Slow Many analysis platforms are not parallel Many are main memory only

  21. My Proposal - SciDB Build a commercial-quality array DBMS from the ground up with integrated analytics Open source!!!!

  22. Data Model Nested multi-dimensional arrays Cells can be tuples or other arrays Time is an automatically supported extra dimension Ragged arrays allow each row/column to have a different dimensionality Support for multiple flavors of ‘null’ Array cells can be ‘EMPTY’ User-definable, context-sensitive treatment

  23. Array-oriented storage manager Optimized for both dense and sparse array data Different data storage, compression, and access Arrays are “chunked” (in multiple dimensions) Chunks are partitioned across a collection of nodes Chunks have ‘overlap’ to support neighborhood operations Replication provides efficiency and back-up No overwrite

  24. Architecture Shared nothing cluster 10’s–1000’s of nodes Commodity hardware TCP/IP between nodes Linear scale-up Each node has a processor and storage Queries refer to arrays as if not distributed Query planner optimizes queries for efficient data access & processing Query plan runs on a node’s local executor&storage manager Runtime supervisor coordinates execution Java, C++, whatever… Doesn’t require JDBC, ODBC AQL an extension of SQL Also supports UDFs

  25. SciDB DDL CREATE ARRAY Test_Array <A: integer NULLS, B: double, C: USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10] PARTITION OVER ( Node1, Node2, Node3 ) USING block_cyclic(); attribute names A, B, C index names I, J chunk size 1000 overlap 10

  26. Array Query Language (AQL) Array data management e.g. filter, aggregate, join, etc. Statistical & linear algebra operations multiply, QR factorization, etc. parallel, disk-oriented User-defined operators (Postgres-style) Interface to external stat packages (e.g. R)

  27. Array Query Language (AQL) SELECT Geo-Mean ( T.B ) FROM Test_Array T WHERE T.I BETWEEN :C1 AND :C2 AND T.J BETWEEN :C3 AND :C4 AND T.A = 10 GROUP BY T.I; User-defined aggregate on an attribute B in array T Subsample Filter Group-by So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL

  28. Matrix Multiply Algorithm sensitive to distribution (range, BC) For range, the smaller of the two arrays is replicated at all nodes Scatter-gather Each node does its “core” of the bigger array with the replicated smaller one Produces a distributed answer CREATE ARRAYTS_Data < A1:int32, B1:double > [ I=0:99999,1000,0, J=0:3999,100,0 ] multiply (project (TS_Data, B1), transpose(project (TS_Data, B1))) 10K x 4K Project on one attribute

  29. Other Features Science Guys Want(These could be in an RDBMS, but aren’t) Uncertainty Data has error bars Which must be carried along in the computation interval arithmentic

  30. Other Features Time travel Don’t fix errors by overwrite i.e. keep all the data – in the extra time dimension Named versions Recalibration usually handled this way

  31. Other Features Provenance (lineage) What calibration generated the data What was the “cooking” algorithm In general - repeatabiltiy of derviced data

  32. SciDB statuswww.scidb.org • Global development team (17 people) including many volunteers • Support/distribution by a VC-backed company • Paradigm4 in Waltham • 0.75 on website now • Have at it!!! • 1.0 release will be on website within a month • Very functional, but a little pokey • Red Queen will be out in October • Blazing performance • Probably 5X current system

  33. Performance of SciDB 1.0 • Performance on stat operations • Comparable to R, degrades in a similar way when data does not fit in main memory • But multi-node parallelism provided -- much better scalability • Performance on SQL • Comparable to or better than Postgres, especially on multi-attribute restrictions • And multi-node parallelism provided • And we can do stuff they can’t….

More Related