Escience data management
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

eScience Data Management PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on
  • Presentation posted in: General

eScience Data Management. It’s not just size that matters, it’s what you can do with it. Bill Howe, Phd eScience Institute. me. from eScience Rollout, 11/5/08. My Background. BS Industrial and Systems Engineering, GA Tech 1999 Big 3 Consulting with Deloitte 99-00

Download Presentation

eScience Data Management

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Escience data management

eScience Data Management

It’s not just size that matters, it’s what you can do with it

Bill Howe, Phd

eScience Institute


Escience data management

me

from eScience Rollout, 11/5/08

Bill Howe, eScience Institute


My background

My Background

  • BS Industrial and Systems Engineering, GA Tech 1999

  • Big 3 Consulting with Deloitte 99-00

    • Residual guilt from call centers of consultants burning $50k/day

  • Independent Consulting00-01

    • Microsoft, Siebel, Schlumberger, Verizon

  • Phd, Computer Science, Portland State University, 2006 (via OGI)

    • Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical Sciences”, Advisor: David Maier

  • Postdoc and Data Architect 06-08

    • NSF Science and Technology Center for

      Coastal Margin Observation and Prediction (CMOP)

Bill Howe, eScience Institute


All science is becoming escience

All Science is becoming eScience

Empirical X  Analytical X  Computational X  X-informatics

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)

New model: “Download the world” (Data acquired en masse, independent of hypotheses)

But: Acquisition now outpaces analysis

  • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)

  • Medicine: ubiquitous digital records, MRI, ultrasound

  • Oceanography: high-resolution models, cheap sensors, satellites

  • Biology: automated PCR, high-throughput sequencing

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Bill Howe, eScience Institute


The long tail

“The future is already here. It’s just not very evenly distributed.”-- William Gibson

The Long Tail

  • Researchers with growing data management challenges but limited resources for cyberinfrastructure

  • No dedicated IT staff

  • Overreliance on simple tools (e.g., spreadsheets)

LSST (~100PB)

CERN (~15PB/year)

PanSTARRS (~40PB)

The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

data inventory

SDSS (~100TB)

Ocean Modelers

CARMEN (~50TB)

Seis-mologists

<Spreadsheet users>

Microbiologists

ordinal position

Bill Howe, eScience Institute


Heterogeneity also drives costs

Heterogeneity also drives costs

LSST

(~100PB; images, objects)

CERN

(~15PB/year, particle interactions)

PanSTARRS

(~40PB; images, objects, trajectories)

# of bytes

SDSS

(~100TB; images, objects)

OOI

(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)

Biologists

(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)

# of data types

Bill Howe, eScience Institute


Facets of data management

complexity-hiding interfaces

Access Methods

Query Languages

Web Services

Visualization; Workflow

Storage Management

Data Integration

Knowledge Extraction,

Crawlers

Data Mining,

Distributed Programming Models,

Provenance

Facets of Data Management

The DB maxim: push computation to the data

Bill Howe, eScience Institute


Example relational databases

Example: Relational Databases

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70].

The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak”

physical data independence

logical data independence

Bill Howe, eScience Institute


Medium scale data management toolbox

Medium-Scale Data Management Toolbox

Relational Databases

The “hammer” of data management

Scientific Workflow Systems

[Howe, Freire, Silva, et al. 2008]

Science “Mashups”

[Howe, Green-Fishback, Maier, 2009]

“Dataspace” systems

[Howe, Maier, Rayner, Rucker 2008]

Bill Howe, eScience Institute


Large scale data management toolbox

Large-Scale Data Management Toolbox

Amazon S3

RDBMS-like features in the cloud

Note: cost effectiveness unclear for large datasets

MapReduce

Parallel programming using functional programming abstractions

(Google)

Howe, Freire, Silva: 2009 NSF CluE Award

Connolly, Gardner: 2009 NSF CluE Award

Dryad

Parallel programming via relational algebra plus type safety, monitoring, debugging

(Michael Isard, Microsoft Research)

Bill Howe, eScience Institute


Current activities

Current Activities

  • Consulting: Armbrust Lab

    (next slide)

  • Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)

Bill Howe, eScience Institute


Consulting armbrust lab

Consulting: Armbrust Lab

  • Initial Goal: Corral and inventory all relevant data

    • SOLiD sequencer: potentially 0.5 TB / day, flat files

    • Metadata: small relational DB + Rails/Django web app

    • Data Products: visualizations, intermediate results

    • Ad hoc scripts and programs

  • Initial Goal: Amplify programmer effort

    • Change is constant: No “one size fits all” solution; ad hoc development is the norm

    • Strategy: Teach biologists to “fish” (David Schruth’s R course)

    • Strategy: Develop an infrastructure that enables and encourages reuse -- scientific workflow systems

key idea: these are data too

Bill Howe, eScience Institute


Scientific workflow systems

Scientific Workflow Systems

  • Value proposition: More time on science, less time on code

  • How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency

    • Provenance

    • Automatic task-parallelism

    • Visual programming

    • Caching

    • Domain-specific toolkits

  • Many examples from eScience and DB communities:

    • Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more

Bill Howe, eScience Institute


Escience data management

Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008.

http://www.microsoft.com/mscorp/tc/trident.mspx

Bill Howe, eScience Institute


Escience data management

Bill Howe, eScience Institute

screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah


Escience data management

Bill Howe, eScience Institute

screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah


Escience data management

Bill Howe @ CMOP computes salt flux using GridFields

Erik Anderson @ Utah adds vector streamlines and adjusts opacity

Peter Lawson adds discussion of the scientific interpretation

Bill Howe @ CMOP adds an isosurface of salinity

source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)

Bill Howe, eScience Institute


Strategy at armbrust lab

Strategy at Armbrust Lab

  • Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings

  • “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake

  • “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut.

Informed by two of Jim Gray’s Laws of Data Engineering:

  • Start with “20 queries”

  • Go from “working to working”

Bill Howe, eScience Institute


Nsf award cluster exploratory clue

NSF Award: Cluster Exploratory (CluE)

  • Partnership between NSF, IBM, Google

  • Data-intensive computing: “I/O farm”

    • massive queries, not massive simulations

    • “in ferro” experiments

  • To “Cloud-Enable” GridFields and VisTrails

    • Goal: 10+-year climatologies at interactive speeds

    • Requires turning over up to 25TB < 5s

    • Provenance, reproducibility, visualization: VisTrails

      • Connect rich desktop experience to cloud query engine

  • Co-PIs from University of Utah

    • Claudio Silva and Juliana Freire

Bill Howe, eScience Institute


Ahmdahl s laws

Ahmdahl’s Laws

Gene Amdahl (1965): Laws for a balanced system

  • Parallelism: max speedup is S/(S+P)

  • One bit of IO/sec per instruction/sec (BW)

  • One byte of memory per one instruction/sec (MEM)

  • One IO per 50,000 instructions (IO)

    Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006)

    For a Blue Gene the BW=0.001, MEM=0.12.

    For the JHU cluster BW=0.5, MEM=1.04

source: Alex Szalay, keynote, eScience 2008

Bill Howe, eScience Institute


Climatology

Climatology

May

Feb

Washington

Columbia River

Oregon

Average Surface Salinity by Month

Columbia River Plume 1999-2006

psu

animation

Bill Howe, eScience Institute


Escience data management

1

3

4

5

6

7

2

8

9

15

10

11

12

13

14

16

17

23

18

(b)

19

20

21

22

psu

31

24

25

26

27

28

29

30

Bill Howe, eScience Institute


Epilogue

Epilogue

We’re here to help!

SIG Wiki:

https://sig.washington.edu/itsigs/SIG_eScience

eScience Blog:

http://escience.washington.edu/blog/

eScience wesbite:

http://www.washington.edu/uwtech/escience.html

Bill Howe, eScience Institute


Escience data management

Bill Howe, eScience Institute


Escience requirements are fractal

eScience requirements are Fractal

William Gibson -- “The future is already here. It’s just not very evenly distributed.”

Bill Howe, eScience Institute


Escience

eScience

High-Performance Computing

Data Management

Online Collaboration Tools

CS Research

Consulting

Bill Howe, eScience Institute


It s what you can do with it

It’s what you can do with it

  • Relational database

    • SQL, plus UDTs and UDFs as needed

  • FASTA databases

    • Alignments, rarefaction curves, phylogenetic trees, filtering

  • MapReduce:

    • Roll your own

  • Dryad

    • Relational algebra available; you can still roll our own if needed

Bill Howe, eScience Institute


A data deluge in all fields

A data deluge in all fields

 X-informatics

Empirical X  Analytical X  Computational X

Acquisition eventually outpaces analysis

  • Astronomy: SDSS, now LSST; PanSTARRS

  • Biology: PCR, SOLiD sequencing

  • Oceanography: high-resolution models, cheap sensors

  • Marine Microbiology: FlowCytometer

Bill Howe, eScience Institute

“Increase Data Collection Exponentially in Less Time, with FlowCAM”


Escience data management

High-Performance Computing

Data Management

Online Collaboration

Consulting

Community Building

Technology Transfer

eScience Research


Query languages

Query Languages

  • Organize and encapsulate access methods

  • Raise the level of abstraction beyond GPLs

  • Identify and exploit opportunities for algebraic optimization

    • What is algebraic optimization? Consider the expression x/z + y/z

      x/z + y/z = (x + y)/z, but the latter is less expensive since it involves only one division operation

  • Tables -- SQL

  • XML -- XQuery, XPath

  • RDF -- SPARQL

  • Streams -- StreamSQL, CQL

  • Meshes (e.g., Finite Element Sims) -- GridFields

Bill Howe, eScience Institute


Example relational databases in codd we trust

Example: Relational Databases (In Codd we Trust…)

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1.

The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.

1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970

Bill Howe, eScience Institute


Gray s laws of data engineering

Gray’s Laws of Data Engineering

Jim Gray:

Scientific computing is revolving around data

Need scale-out solution for analysis

Take the analysis to the data!

Start with “20 queries”

Go from “working to working”

DISSC: Data Intensive Scalable Scientific Computing

slide source: Alex Szalay, keynote, eScience 2008

Bill Howe, eScience Institute


Data management

Data Management

Bill Howe, eScience Institute


  • Login