Escience data management
Sponsored Links
This presentation is the property of its rightful owner.
1 / 33

eScience Data Management PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on
  • Presentation posted in: General

eScience Data Management. It’s not just size that matters, it’s what you can do with it. Bill Howe, Phd eScience Institute. me. from eScience Rollout, 11/5/08. My Background. BS Industrial and Systems Engineering, GA Tech 1999 Big 3 Consulting with Deloitte 99-00

Download Presentation

eScience Data Management

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


eScience Data Management

It’s not just size that matters, it’s what you can do with it

Bill Howe, Phd

eScience Institute


me

from eScience Rollout, 11/5/08

Bill Howe, eScience Institute


My Background

  • BS Industrial and Systems Engineering, GA Tech 1999

  • Big 3 Consulting with Deloitte 99-00

    • Residual guilt from call centers of consultants burning $50k/day

  • Independent Consulting00-01

    • Microsoft, Siebel, Schlumberger, Verizon

  • Phd, Computer Science, Portland State University, 2006 (via OGI)

    • Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical Sciences”, Advisor: David Maier

  • Postdoc and Data Architect 06-08

    • NSF Science and Technology Center for

      Coastal Margin Observation and Prediction (CMOP)

Bill Howe, eScience Institute


All Science is becoming eScience

Empirical X  Analytical X  Computational X  X-informatics

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)

New model: “Download the world” (Data acquired en masse, independent of hypotheses)

But: Acquisition now outpaces analysis

  • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)

  • Medicine: ubiquitous digital records, MRI, ultrasound

  • Oceanography: high-resolution models, cheap sensors, satellites

  • Biology: automated PCR, high-throughput sequencing

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Bill Howe, eScience Institute


“The future is already here. It’s just not very evenly distributed.”-- William Gibson

The Long Tail

  • Researchers with growing data management challenges but limited resources for cyberinfrastructure

  • No dedicated IT staff

  • Overreliance on simple tools (e.g., spreadsheets)

LSST (~100PB)

CERN (~15PB/year)

PanSTARRS (~40PB)

The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

data inventory

SDSS (~100TB)

Ocean Modelers

CARMEN (~50TB)

Seis-mologists

<Spreadsheet users>

Microbiologists

ordinal position

Bill Howe, eScience Institute


Heterogeneity also drives costs

LSST

(~100PB; images, objects)

CERN

(~15PB/year, particle interactions)

PanSTARRS

(~40PB; images, objects, trajectories)

# of bytes

SDSS

(~100TB; images, objects)

OOI

(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)

Biologists

(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)

# of data types

Bill Howe, eScience Institute


complexity-hiding interfaces

Access Methods

Query Languages

Web Services

Visualization; Workflow

Storage Management

Data Integration

Knowledge Extraction,

Crawlers

Data Mining,

Distributed Programming Models,

Provenance

Facets of Data Management

The DB maxim: push computation to the data

Bill Howe, eScience Institute


Example: Relational Databases

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70].

The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak”

physical data independence

logical data independence

Bill Howe, eScience Institute


Medium-Scale Data Management Toolbox

Relational Databases

The “hammer” of data management

Scientific Workflow Systems

[Howe, Freire, Silva, et al. 2008]

Science “Mashups”

[Howe, Green-Fishback, Maier, 2009]

“Dataspace” systems

[Howe, Maier, Rayner, Rucker 2008]

Bill Howe, eScience Institute


Large-Scale Data Management Toolbox

Amazon S3

RDBMS-like features in the cloud

Note: cost effectiveness unclear for large datasets

MapReduce

Parallel programming using functional programming abstractions

(Google)

Howe, Freire, Silva: 2009 NSF CluE Award

Connolly, Gardner: 2009 NSF CluE Award

Dryad

Parallel programming via relational algebra plus type safety, monitoring, debugging

(Michael Isard, Microsoft Research)

Bill Howe, eScience Institute


Current Activities

  • Consulting: Armbrust Lab

    (next slide)

  • Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)

Bill Howe, eScience Institute


Consulting: Armbrust Lab

  • Initial Goal: Corral and inventory all relevant data

    • SOLiD sequencer: potentially 0.5 TB / day, flat files

    • Metadata: small relational DB + Rails/Django web app

    • Data Products: visualizations, intermediate results

    • Ad hoc scripts and programs

  • Initial Goal: Amplify programmer effort

    • Change is constant: No “one size fits all” solution; ad hoc development is the norm

    • Strategy: Teach biologists to “fish” (David Schruth’s R course)

    • Strategy: Develop an infrastructure that enables and encourages reuse -- scientific workflow systems

key idea: these are data too

Bill Howe, eScience Institute


Scientific Workflow Systems

  • Value proposition: More time on science, less time on code

  • How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency

    • Provenance

    • Automatic task-parallelism

    • Visual programming

    • Caching

    • Domain-specific toolkits

  • Many examples from eScience and DB communities:

    • Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more

Bill Howe, eScience Institute


Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008.

http://www.microsoft.com/mscorp/tc/trident.mspx

Bill Howe, eScience Institute


Bill Howe, eScience Institute

screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah


Bill Howe, eScience Institute

screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah


Bill Howe @ CMOP computes salt flux using GridFields

Erik Anderson @ Utah adds vector streamlines and adjusts opacity

Peter Lawson adds discussion of the scientific interpretation

Bill Howe @ CMOP adds an isosurface of salinity

source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)

Bill Howe, eScience Institute


Strategy at Armbrust Lab

  • Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings

  • “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake

  • “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut.

Informed by two of Jim Gray’s Laws of Data Engineering:

  • Start with “20 queries”

  • Go from “working to working”

Bill Howe, eScience Institute


NSF Award: Cluster Exploratory (CluE)

  • Partnership between NSF, IBM, Google

  • Data-intensive computing: “I/O farm”

    • massive queries, not massive simulations

    • “in ferro” experiments

  • To “Cloud-Enable” GridFields and VisTrails

    • Goal: 10+-year climatologies at interactive speeds

    • Requires turning over up to 25TB < 5s

    • Provenance, reproducibility, visualization: VisTrails

      • Connect rich desktop experience to cloud query engine

  • Co-PIs from University of Utah

    • Claudio Silva and Juliana Freire

Bill Howe, eScience Institute


Ahmdahl’s Laws

Gene Amdahl (1965): Laws for a balanced system

  • Parallelism: max speedup is S/(S+P)

  • One bit of IO/sec per instruction/sec (BW)

  • One byte of memory per one instruction/sec (MEM)

  • One IO per 50,000 instructions (IO)

    Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006)

    For a Blue Gene the BW=0.001, MEM=0.12.

    For the JHU cluster BW=0.5, MEM=1.04

source: Alex Szalay, keynote, eScience 2008

Bill Howe, eScience Institute


Climatology

May

Feb

Washington

Columbia River

Oregon

Average Surface Salinity by Month

Columbia River Plume 1999-2006

psu

animation

Bill Howe, eScience Institute


1

3

4

5

6

7

2

8

9

15

10

11

12

13

14

16

17

23

18

(b)

19

20

21

22

psu

31

24

25

26

27

28

29

30

Bill Howe, eScience Institute


Epilogue

We’re here to help!

SIG Wiki:

https://sig.washington.edu/itsigs/SIG_eScience

eScience Blog:

http://escience.washington.edu/blog/

eScience wesbite:

http://www.washington.edu/uwtech/escience.html

Bill Howe, eScience Institute


Bill Howe, eScience Institute


eScience requirements are Fractal

William Gibson -- “The future is already here. It’s just not very evenly distributed.”

Bill Howe, eScience Institute


eScience

High-Performance Computing

Data Management

Online Collaboration Tools

CS Research

Consulting

Bill Howe, eScience Institute


It’s what you can do with it

  • Relational database

    • SQL, plus UDTs and UDFs as needed

  • FASTA databases

    • Alignments, rarefaction curves, phylogenetic trees, filtering

  • MapReduce:

    • Roll your own

  • Dryad

    • Relational algebra available; you can still roll our own if needed

Bill Howe, eScience Institute


A data deluge in all fields

 X-informatics

Empirical X  Analytical X  Computational X

Acquisition eventually outpaces analysis

  • Astronomy: SDSS, now LSST; PanSTARRS

  • Biology: PCR, SOLiD sequencing

  • Oceanography: high-resolution models, cheap sensors

  • Marine Microbiology: FlowCytometer

Bill Howe, eScience Institute

“Increase Data Collection Exponentially in Less Time, with FlowCAM”


High-Performance Computing

Data Management

Online Collaboration

Consulting

Community Building

Technology Transfer

eScience Research


Query Languages

  • Organize and encapsulate access methods

  • Raise the level of abstraction beyond GPLs

  • Identify and exploit opportunities for algebraic optimization

    • What is algebraic optimization? Consider the expression x/z + y/z

      x/z + y/z = (x + y)/z, but the latter is less expensive since it involves only one division operation

  • Tables -- SQL

  • XML -- XQuery, XPath

  • RDF -- SPARQL

  • Streams -- StreamSQL, CQL

  • Meshes (e.g., Finite Element Sims) -- GridFields

Bill Howe, eScience Institute


Example: Relational Databases (In Codd we Trust…)

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1.

The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.

1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970

Bill Howe, eScience Institute


Gray’s Laws of Data Engineering

Jim Gray:

Scientific computing is revolving around data

Need scale-out solution for analysis

Take the analysis to the data!

Start with “20 queries”

Go from “working to working”

DISSC: Data Intensive Scalable Scientific Computing

slide source: Alex Szalay, keynote, eScience 2008

Bill Howe, eScience Institute


Data Management

Bill Howe, eScience Institute


  • Login