the data author s perspective lessons learned from data creation to data curation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation PowerPoint Presentation
Download Presentation
The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation

Loading in 2 Seconds...

play fullscreen
1 / 33

The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Collect. Store. Present. Analyze. Search. Retrieve. The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation. Jeff Dozier James E. Frew. Snow spectral reflectance and absorption coefficient of ice. Landsat Thematic Mapper (TM) band combinations.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the data author s perspective lessons learned from data creation to data curation

Collect

Store

Present

Analyze

Search

Retrieve

The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation

Jeff DozierJames E. Frew

landsat thematic mapper tm band combinations
Landsat Thematic Mapper (TM) band combinations

Bands 4,3,2 (R,G,B)

Bands 5,4,2 (R,G,B)

examples of fractional snow cover january through april 2004
Examples of fractional snow cover, January through April 2004

Jan 01 2004

Mar 26 2004

Jan 17 2004

Apr 08 2004

examples of grain size january through april 2004
Examples of grain size, January through April 2004

Jan 01 2004

Mar 26 2004

Jan 17 2004

Apr 08 2004

effect of vegetation
Effect of vegetation

2004, March 3 vs March 4

2004, March 4 vs March 5

2004, March 5 vs March 7

2004, March 7 vs March 8

applications snowmelt modeling marble fork of the kaweah river molotch et al grl 2004
Applications: snowmelt modeling, Marble Fork of the Kaweah River(Molotch et al., GRL, 2004)

Snow Covered Area

net radiation> 0

degree days > 0

where:

mq= Energy to water depth conversion, 0.026 cm W-1 m2 day-1

magnitude of snowmelt modeled observed snow water equivalence
Magnitude of snowmelt: Modeled – Observed snow water equivalence

AVIRISalbedo

SWE difference, cm

Tokopah basin, Sierra Nevada

assumed w/ update

assumedalbedo

the data author s perspective on drivers and constraints
The data author’s perspective on drivers and constraints
  • The science information user:
    • I want reliable, timely, usable science information products
      • Accessibility
      • Accountability
  • The funding agencies and the science community:
    • We want this to be done by a distributed federation of providers, not just by data centers
      • Scalability
  • The science information provider:
    • I’m doing just fine, thanks.
      • Transparency
research vs production computing
Research computing is …

Heterogeneous

multiple platforms, applications, languages

Idiosyncratic

researchers typically have highly customized computing environments

Problem-driven

focus on results, not processes

Production computing is …

Robust

reliable, not just correct

Standardized

can easily substitute components for repair, upgrade, etc.

Scalable

accommodates steady or increasing demand for product

Research vs. production computing
principles
Principles
  • Goal
    • Help scientists become information providers in a federated data system
  • Prime Directive
    • Minimal disruption of a working scientist’s computational environment
  • Ultimate product
    • Software, system architecture, and procedures for turning science projects into a federation of providers
essw our earth system science workbench
ESSW: Our Earth System Science Workbench

Producer and consumer issues can both be addressedby a laboratory metaphor

  • Experiment
    • Network of models
    • … ingesting / synthesizing data
    • … generating products
  • Laboratory
    • Experiment execution environment
      • Computing + storage = accessibility + scalability
  • Lab Notebook
    • Persistent storage that can be queried
    • Keeps track of all experiments
      • Documentation + lineage = accountability
use existing science applications
Use existing science applications
  • No “standard” Earth science computing environment
    • commercial packages (ArcGIS, ENVI, MATLAB, …)
    • public packages/models (MM5, MODTRAN, …)
    • locally-developed codes
  • Example: Snow cover from AVHRR commercial + standalone programs
    • parameters highly customized for UCSB
  • How do we get these programs to
    • communicate
    • cooperate

with the Earth System Science Workbench (ESSW), without rewriting?

Receive

Ingest and Calibrate

Navigate

(Manual/Automatic)

Snow-Covered Area

Rectify

Snow

Maps

wrap your app scripts talk to essw
Wrap Your App: Scripts talk to ESSW

XML + SQL

Perl API

ESSWdaemon

  • No changes,just additions
    • Wrapper scripts
      • Make program (groups) look like ESSW experiments
    • ESSW daemon
      • Convertswrapper outputtodatabase input
    • ESSW database
      • Stores converted wrapper output

Receive

Ingest and Calibrate

ESSW

Database

Navigate

(Manual/Automatic)

Snow-Covered Area

Rectify

MySQL

Java

JDBC

Perl

Snow

Maps

detailed example

avhrr_L0

AVHRR Level 0 product

Detailedexample

AVHRR telemetry ingest

avhrr_ingest

Hand navigation details

avhrr_l1b

AHVRR Level 1B

product

Hand navigation

avhrr_

procedure

handNav

avhrr_

AVHRR Level 1B:

navd_l1b

Multi-channel

navigated

snow-covered

avhrr_

area

snowModel

algorithm

avhrr_sca

Snow-covered area

Copy

avhrr_

navigated

copyNav

image

avhrr_

navd_sca

SCA: navigated

essw lessons
ESSW Lessons
  • Providers are customers
    • Federations aren’t much good unless scientists are happy to put information in them
  • A light touch is the right touch
    • Wrapping is easier for scientists and their programmers to deal with than complete re-engineering
  • Scientists do write scripts, but not necessarily Perl
    • Scripting (gluing stuff together) comes naturally to scientists
  • Scientists don’t write DTDs
  • Nobody calls metadata APIs

ESSW was automatic, but not automatic enough…

slide26

ES3 : Earth System Science Server

data lineage tracking

MODster

OpenDAP

Watershed-scale snow product

MODIS

Microsoft TerraServer

AVHRR

Global-scale snow product

Alexandria Digital Library

Corona

BUB data storage

ROCKS processing clusters

from essw to es 3 summary
From ESSW to ES3: Summary
  • Perl wrappers  Probulators
  • Perl API  web services + RDF messages
  • SQL  XML database(s)
from wrappers to probulators
From wrappers to probulators

Wrappers: active lineage

  • Good
    • Complete control over what gets recorded
    • Single language/API for all wrapped events
    • Not tied to execution
      • You can even lie about what happened
  • Bad
    • Must explicitly script everything
    • Scripts can drift from reality
      • You can even lie about what happened
from wrappers to probulators1
From wrappers to probulators

Probulators: passive lineage

  • Good
    • Record what actually happened
      • Not just what you think happened
      • Not what didn’t happen
    • Automatic: don’t have to write new scripts for everything
  • Bad
    • Different flavors for different environments
      • Can’t just do everything in Perl…
probulator flavors
Probulator flavors
  • Instrumentation
    • Insert lineage capture instructions directly into science codes
      • e.g. “I just created file ‘foo’”
    • Typical implementation: preprocessor/precompiler
  • Overriding
    • Replace standard routines/libraries with lineage-capturing versions
      • e.g. open(…) → snoopy_open(…)
    • Typical implementation: modify execution environment
      • environment variables
      • configuration files
  • Passive monitoring
    • Trace program execution
      • e.g. “called open() with args foo, bar, …”
    • Typical implementation: strace’d shell
es 3 lineage architecture

logfiles

ES3 lineage architecture

probulator1

logger

transmitter

ES3 core

probulatorn

now what
Now What?
  • Probulator reports not universally unique
    • Q: How hook separate reports together?
    • A: Logger assigns UUIDs to
      • Data streams
      • Processes
      • Jobs (workflows)
  • Lineage not explicit
    • Q: How publish lineage?
    • A: ES3 Core builds serialized graph
products available from http www snow ucsb edu forthcoming
Products available from http://www.snow.ucsb.edu (forthcoming)
  • Fractional snow-covered area, grain size (and contaminants) from daily MODIS images
    • Quality flags for cloud cover, highly oblique viewing
    • Fractional coverage of other endmembers
  • Best estimate of snow-covered area and broadband albedo on that date
    • Extrapolating from previous values to that date and smoothing
  • End-of-season reanalysis of daily snow-covered area and broadband albedo
    • Interpolation, smoothing, comparison with in situ snow pillow data