escience data management
Download
Skip this Video
Download Presentation
eScience Data Management

Loading in 2 Seconds...

play fullscreen
1 / 33

eScience Data Management - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

eScience Data Management. It’s not just size that matters, it’s what you can do with it. Bill Howe, Phd eScience Institute. me. from eScience Rollout, 11/5/08. My Background. BS Industrial and Systems Engineering, GA Tech 1999 Big 3 Consulting with Deloitte 99-00

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' eScience Data Management' - matt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
escience data management

eScience Data Management

It’s not just size that matters, it’s what you can do with it

Bill Howe, Phd

eScience Institute

slide2

me

from eScience Rollout, 11/5/08

Bill Howe, eScience Institute

my background
My Background
  • BS Industrial and Systems Engineering, GA Tech 1999
  • Big 3 Consulting with Deloitte 99-00
    • Residual guilt from call centers of consultants burning $50k/day
  • Independent Consulting 00-01
    • Microsoft, Siebel, Schlumberger, Verizon
  • Phd, Computer Science, Portland State University, 2006 (via OGI)
    • Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical Sciences”, Advisor: David Maier
  • Postdoc and Data Architect 06-08
    • NSF Science and Technology Center for

Coastal Margin Observation and Prediction (CMOP)

Bill Howe, eScience Institute

all science is becoming escience
All Science is becoming eScience

Empirical X  Analytical X  Computational X  X-informatics

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)

New model: “Download the world” (Data acquired en masse, independent of hypotheses)

But: Acquisition now outpaces analysis

  • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
  • Medicine: ubiquitous digital records, MRI, ultrasound
  • Oceanography: high-resolution models, cheap sensors, satellites
  • Biology: automated PCR, high-throughput sequencing

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Bill Howe, eScience Institute

the long tail

“The future is already here. It’s just not very evenly distributed.”-- William Gibson

The Long Tail
  • Researchers with growing data management challenges but limited resources for cyberinfrastructure
  • No dedicated IT staff
  • Overreliance on simple tools (e.g., spreadsheets)

LSST (~100PB)

CERN (~15PB/year)

PanSTARRS (~40PB)

The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

data inventory

SDSS (~100TB)

Ocean Modelers

CARMEN (~50TB)

Seis-mologists

<Spreadsheet users>

Microbiologists

ordinal position

Bill Howe, eScience Institute

heterogeneity also drives costs
Heterogeneity also drives costs

LSST

(~100PB; images, objects)

CERN

(~15PB/year, particle interactions)

PanSTARRS

(~40PB; images, objects, trajectories)

# of bytes

SDSS

(~100TB; images, objects)

OOI

(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)

Biologists

(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)

# of data types

Bill Howe, eScience Institute

facets of data management

complexity-hiding interfaces

Access Methods

Query Languages

Web Services

Visualization; Workflow

Storage Management

Data Integration

Knowledge Extraction,

Crawlers

Data Mining,

Distributed Programming Models,

Provenance

Facets of Data Management

The DB maxim: push computation to the data

Bill Howe, eScience Institute

example relational databases
Example: Relational Databases

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70].

The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak”

physical data independence

logical data independence

Bill Howe, eScience Institute

medium scale data management toolbox
Medium-Scale Data Management Toolbox

Relational Databases

The “hammer” of data management

Scientific Workflow Systems

[Howe, Freire, Silva, et al. 2008]

Science “Mashups”

[Howe, Green-Fishback, Maier, 2009]

“Dataspace” systems

[Howe, Maier, Rayner, Rucker 2008]

Bill Howe, eScience Institute

large scale data management toolbox
Large-Scale Data Management Toolbox

Amazon S3

RDBMS-like features in the cloud

Note: cost effectiveness unclear for large datasets

MapReduce

Parallel programming using functional programming abstractions

(Google)

Howe, Freire, Silva: 2009 NSF CluE Award

Connolly, Gardner: 2009 NSF CluE Award

Dryad

Parallel programming via relational algebra plus type safety, monitoring, debugging

(Michael Isard, Microsoft Research)

Bill Howe, eScience Institute

current activities
Current Activities
  • Consulting: Armbrust Lab

(next slide)

  • Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)

Bill Howe, eScience Institute

consulting armbrust lab
Consulting: Armbrust Lab
  • Initial Goal: Corral and inventory all relevant data
    • SOLiD sequencer: potentially 0.5 TB / day, flat files
    • Metadata: small relational DB + Rails/Django web app
    • Data Products: visualizations, intermediate results
    • Ad hoc scripts and programs
  • Initial Goal: Amplify programmer effort
    • Change is constant: No “one size fits all” solution; ad hoc development is the norm
    • Strategy: Teach biologists to “fish” (David Schruth’s R course)
    • Strategy: Develop an infrastructure that enables and encourages reuse -- scientific workflow systems

key idea: these are data too

Bill Howe, eScience Institute

scientific workflow systems
Scientific Workflow Systems
  • Value proposition: More time on science, less time on code
  • How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency
    • Provenance
    • Automatic task-parallelism
    • Visual programming
    • Caching
    • Domain-specific toolkits
  • Many examples from eScience and DB communities:
    • Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more

Bill Howe, eScience Institute

slide14

Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008.

http://www.microsoft.com/mscorp/tc/trident.mspx

Bill Howe, eScience Institute

slide15
Bill Howe, eScience Institute

screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

slide16
Bill Howe, eScience Institute

screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

slide17

Bill Howe @ CMOP computes salt flux using GridFields

Erik Anderson @ Utah adds vector streamlines and adjusts opacity

Peter Lawson adds discussion of the scientific interpretation

Bill Howe @ CMOP adds an isosurface of salinity

source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)

Bill Howe, eScience Institute

strategy at armbrust lab
Strategy at Armbrust Lab
  • Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings
  • “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake
  • “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut.

Informed by two of Jim Gray’s Laws of Data Engineering:

  • Start with “20 queries”
  • Go from “working to working”

Bill Howe, eScience Institute

nsf award cluster exploratory clue
NSF Award: Cluster Exploratory (CluE)
  • Partnership between NSF, IBM, Google
  • Data-intensive computing: “I/O farm”
    • massive queries, not massive simulations
    • “in ferro” experiments
  • To “Cloud-Enable” GridFields and VisTrails
    • Goal: 10+-year climatologies at interactive speeds
    • Requires turning over up to 25TB < 5s
    • Provenance, reproducibility, visualization: VisTrails
      • Connect rich desktop experience to cloud query engine
  • Co-PIs from University of Utah
    • Claudio Silva and Juliana Freire

Bill Howe, eScience Institute

ahmdahl s laws
Ahmdahl’s Laws

Gene Amdahl (1965): Laws for a balanced system

  • Parallelism: max speedup is S/(S+P)
  • One bit of IO/sec per instruction/sec (BW)
  • One byte of memory per one instruction/sec (MEM)
  • One IO per 50,000 instructions (IO)

Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006)

For a Blue Gene the BW=0.001, MEM=0.12.

For the JHU cluster BW=0.5, MEM=1.04

source: Alex Szalay, keynote, eScience 2008

Bill Howe, eScience Institute

climatology
Climatology

May

Feb

Washington

Columbia River

Oregon

Average Surface Salinity by Month

Columbia River Plume 1999-2006

psu

animation

Bill Howe, eScience Institute

slide22

1

3

4

5

6

7

2

8

9

15

10

11

12

13

14

16

17

23

18

(b)

19

20

21

22

psu

31

24

25

26

27

28

29

30

Bill Howe, eScience Institute

epilogue
Epilogue

We’re here to help!

SIG Wiki:

https://sig.washington.edu/itsigs/SIG_eScience

eScience Blog:

http://escience.washington.edu/blog/

eScience wesbite:

http://www.washington.edu/uwtech/escience.html

Bill Howe, eScience Institute

escience requirements are fractal
eScience requirements are Fractal

William Gibson -- “The future is already here. It’s just not very evenly distributed.”

Bill Howe, eScience Institute

escience
eScience

High-Performance Computing

Data Management

Online Collaboration Tools

CS Research

Consulting

Bill Howe, eScience Institute

it s what you can do with it
It’s what you can do with it
  • Relational database
    • SQL, plus UDTs and UDFs as needed
  • FASTA databases
    • Alignments, rarefaction curves, phylogenetic trees, filtering
  • MapReduce:
    • Roll your own
  • Dryad
    • Relational algebra available; you can still roll our own if needed

Bill Howe, eScience Institute

a data deluge in all fields
A data deluge in all fields

 X-informatics

Empirical X  Analytical X  Computational X

Acquisition eventually outpaces analysis

  • Astronomy: SDSS, now LSST; PanSTARRS
  • Biology: PCR, SOLiD sequencing
  • Oceanography: high-resolution models, cheap sensors
  • Marine Microbiology: FlowCytometer

Bill Howe, eScience Institute

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

slide29

High-Performance Computing

Data Management

Online Collaboration

Consulting

Community Building

Technology Transfer

eScience Research

query languages
Query Languages
  • Organize and encapsulate access methods
  • Raise the level of abstraction beyond GPLs
  • Identify and exploit opportunities for algebraic optimization
    • What is algebraic optimization? Consider the expression x/z + y/z

x/z + y/z = (x + y)/z, but the latter is less expensive since it involves only one division operation

  • Tables -- SQL
  • XML -- XQuery, XPath
  • RDF -- SPARQL
  • Streams -- StreamSQL, CQL
  • Meshes (e.g., Finite Element Sims) -- GridFields

Bill Howe, eScience Institute

example relational databases in codd we trust
Example: Relational Databases (In Codd we Trust…)

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1.

The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.

1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970

Bill Howe, eScience Institute

gray s laws of data engineering
Gray’s Laws of Data Engineering

Jim Gray:

Scientific computing is revolving around data

Need scale-out solution for analysis

Take the analysis to the data!

Start with “20 queries”

Go from “working to working”

DISSC: Data Intensive Scalable Scientific Computing

slide source: Alex Szalay, keynote, eScience 2008

Bill Howe, eScience Institute

data management
Data Management

Bill Howe, eScience Institute

ad