Introduction to scientific workflows and the kepler system
This presentation is the property of its rightful owner.
Sponsored Links
1 / 83

Introduction to Scientific Workflows and the KEPLER System PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to Scientific Workflows and the KEPLER System. Instructors: Bertram Ludaescher Ilkay Altintas. Overview. 10:30-11:15 Introduction to Scientific Workflows 11:15-12:00 Scientific Workflows in KEPLER live demo, brains-on session … but first, one more time … (déjà déjà vu).

Download Presentation

Introduction to Scientific Workflows and the KEPLER System

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to scientific workflows and the kepler system

Introduction to Scientific Workflows and the KEPLER System

Instructors:

Bertram Ludaescher

Ilkay Altintas


Overview

Overview

  • 10:30-11:15 Introduction to Scientific Workflows

  • 11:15-12:00 Scientific Workflows in KEPLER live demo, brains-on session

  • … but first, one more time … (déjà déjà vu)

TM


Information integration challenges s 4 heterogeneities

Information Integration Challenges: S4 Heterogeneities

  • Systems Integration

    • platforms, devices, data & service distribution, APIs, protocols, …

       Grid middleware technologies

      + e.g. single sign-on, platform independence, transparent use of remote resources, …

  • Syntax & Structure

    • heterogeneous data formats (one for each tool ...)

    • heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …)

    • heterogeneous schemas(one for each DB ...)

       Database mediation technologies

      + XML-based data exchange, integrated views, transparent query rewriting, …

  • Semantics

    • fuzzy metadata, terminology, “hidden” semantics, implicit assumptions, …

       Knowledge representation & semantic mediation technologies

      + “smart” data discovery & integration

      + e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!


Information integration challenges s 5 heterogeneities

Information Integration Challenges: S5 Heterogeneities

  • Synthesis of applications, analysis tools, data & query components, … into “scientific workflows”

    • How to make use of these wonderful things & put them together to solve a scientist’s problem?

  • Scientific Problem Solving Environments (PSEs)

    • GEON Portal and Workbench (“scientist’s view”)

      + ontology-enhanced data registration, discovery, manipulation

      + creation and registration of new data products from existing ones, …

    • GEON Scientific Workflow System (“engineer’s view”)

      + for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools …

      + e.g., creation of new datasets from existing ones, dataset registration,…


What is a scientific workflow swf

What is a Scientific Workflow (SWF)?

  • Goals:

    • automate a scientist’s repetitive data management and analysis tasks

    • typical phases:

      • data access, scheduling, generation, transformation, aggregation, analysis, visualization

         design, test, share, deploy, execute, reuse, … SWFs

  • Typical requirements/characteristics:

    • data-intensive and/or compute-intensive

    • plumbing-intensive

    • dataflow-oriented

    • distributed (data, processing)

    • user-interaction “in the middle”, …

    • … vs. (C-z; bg; fg)-ing (“detach” and reconnect)

    • advanced programming constructs (map(f), zip, takewhile, …)

    • logging, provenance, “registering back” (intermediate) products…

  • … easy to recognize a SWF when you see one!


Promoter identification workflow

Promoter Identification Workflow

Source: Matt Coleman (LLNL)


Introduction to scientific workflows and the kepler system

Source: NIH BIRN (Jeffrey Grethe, UCSD)


Ecology garp analysis pipeline for invasive species prediction

Archive

To Ecogrid

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Test sample (d)

Species

presence &

absence points

(native range)

(a)

Native range prediction

map (f)

Training sample

(d)

GARP

rule set

(e)

Data

Calculation

Map

Generation

Map

Generation

EcoGrid

Query

EcoGrid

Query

Validation

Validation

User

Sample

Data

+A2

+A3

Model quality

parameter (g)

Generate

Metadata

Integrated

layers

(native range) (c)

Layer

Integration

Layer

Integration

+A1

Environmental layers (native

range) (b)

Invasion

area prediction map (f)

Selected

prediction

maps (h)

Model quality

parameter (g)

Integrated layers

(invasion area) (c)

Environmental layers (invasion area) (b)

Species presence &absence points (invasion area) (a)

Ecology: GARP Analysis Pipeline for Invasive Species Prediction

Source: NSF SEEK (Deana Pennington et. al, UNM)


Digression business workflows and systems

Digression: (Business) Workflows and Systems

or: what you need to know when someone wants to sell you one ;-)

or: the remote relatives (2nd-3rd cousins?) of scientific workflows


What is a business workflow

What is a (Business) Workflow?

  • Workflow management (also called Business Process Management) is the coordination of work processes through software.

  • A workflow management system routes pending activities to process participants according to a model of the process.

  • WF management systems have been around since the late 1970s (e.g. Officetalk, Xerox PARK)

    • marketing waves: Office Automation (70’s-80’s), Business Process Reengineering (90’s), Web Services Choreography (00’s)

    • roots/related: document management apps, email system apps, database apps (active DBMS’s, federated DBMS’s)

    • Meanwhile (69’-71’) elsewhere: Flow-based programming (J. Paul Morrison)

    • … not quite workflow but rather dataflow … (we’ll come to that…)

Src/cf: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003


Some history

Some History

Commercial Workflow Systems

Source: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003


Some history1

Some History

Commercial Workflow Systems

Source: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003


Play time @ petri nets world

Play Time @ Petri Nets World

  • Petri Nets are the underlying abstract model of many B-WfMS’s (who said I can’t do bad acronyms, too? ;-)

  • http://www.daimi.au.dk/PetriNets/

  • http://www.daimi.au.dk/PetriNets/introductions/aalst/

  • Let’s see the basic ideas first …


Formal basis petri nets

Formal Basis: Petri Nets

  • Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s)

  • Provides a modeling language w/ rich theory, analysis tools, …

  • A Petri net consists of places (P), transitions (T) and directed arcs (PT or TP). Places can hold tokens.

  • A transition is enabled if each of its input places contains at least one token.

  • An enabled transition can fire, removing input tokens and producing output tokens

P2

Enabled

not enabled

T1

P3

T2

P4

P1


Formal basis petri nets1

Formal Basis: Petri Nets

  • Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s)

  • Provides a modeling language w/ rich theory, analysis tools, …

  • A Petri net consists of places (P), transitions (T) and directed arcs (PT or TP). Places can hold tokens.

  • A transition is enabled if each of its input places contains at least one token.

  • An enabled transition can fire, removing input tokens and producing output tokens

P2

Enabled

not enabled

T1

P3

T2

P4

P1


Why petri nets

Why Petri Nets

  • Modeling and designing concurrent systems w/ competing resources (dining philosophers), …

  • Lots of analysis techniques, tools, theory

    • boundedness (state space),

    • liveness (good things do happen),

    • safety (bad things do not happen),

    • reversibility,

    • deadlock(-freeness),

    • reachability (of certain states),


In a flux ws xx standards

In a Flux: WS-XX-“Standards”

Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/

http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html


Everything flows but what exactly

Everything Flows? But what exactly?

  • Dataflow

    • Data flows through operations (zoom into your CPU…)

    • Activity diagrams: data flows through actions

    • Process networks: data flows between processes

  • Control-flow

    • Nodes are control-flow operations that start other operations on a state

  • Mixed approaches

    • Statecharts: events trigger state transitions

    • Petri nets: tokens mark control and dataflow

    • Workflow languages: mix control and dataflow

    • … many others …


Scientific workflows vs business workflows

Scientific “Workflows” vs Business Workflows

  • Business Workflows (BPEL4WS* …)

    • Task-orientation: travel reservations;credit approval; BPM; …

    • Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout

    • Complex control flow, complex process composition (danger of control flow/dataflow “spaghetti”)

       Dataflow and control-flow are often divorced!

  • Scientific “Workflows”

    • Dataflow and data transformations

    • Data problems: volume, complexity, heterogeneity

    • Grid-aspects

      • Distributed computation

      • Distributed data

    • User-interactions/WF steering

    • Data, tool, and analysis integration

       Dataflow and control-flow are often married! (can be a happy marriage… at times…)

*Business Process Execution Language for Web Services (in case you wondered)


Scientific workflows some findings

Scientific “Workflows”: Some Findings

  • More dataflow than (business control-/) workflow

    • DiscoveryNet, Kepler, SCIRun, Scitegic, Triana, Taverna, …,

  • Need for “programming extensions”

    • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)

  • Need for abstraction and nested workflows

  • Need for data transformations (WS1DTWS2)

  • Need for rich user interaction & workflow steering:

    • pause / revise / resume

    • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF

  • Need for high-throughput data transfers and CPU cyles: “(Data-)Grid-enabling”, “streaming”

  • Need for persistence of intermediate products andprovenance


Perspectives on systems

Perspectives on Systems

/ Dataflow View

Source: Workflow-based Process Controlling, Michael zur Muehlen, 2003


A dataflow component actor

A Dataflow Component (“Actor”)

parameters

$1, $2, …

“actor” /

component

input

channels

output

channels

ports


Actor oriented design

Actor-Oriented Design

  • Object orientation:

What flows through an object is sequential control (cf. CCA, MPI)

class name

data

methods

call

return

What flows through an object is a stream of data tokens

(in SWFs/KEPLER also references!!)

  • Actor/Dataflow orientation:

actor name

data (state)

parameters

Input data

Output data

ports

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/


Object oriented vs actor oriented interfaces

TextToSpeech

initialize(): void

notify(): void

isReady(): boolean

getSpeech(): double[]

Object-Oriented vs.Actor-Oriented Interfaces

Actor/Dataflow

Oriented

Object Oriented

OO interface gives procedures that have to be invoked in an order not specified as part of the interface definition.

AO interface definition says “Give me text and I’ll give you speech”

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/


Ptolemy ii

Ptolemy II

see!

read!

try!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/


History

Ptolemy II: A laboratory for investigating design

KEPLER:

A problem-solving environment for Scientific Workflows

KEPLER = “Ptolemy II + X” for Scientific Workflows

History

  • Gabriel (1986-1991)

    • Written in Lisp

    • Aimed at signal processing

    • Synchronous dataflow (SDF) block diagrams

    • Parallel schedulers

    • Code generators for DSPs

    • Hardware/software co-simulators

  • Ptolemy Classic (1990-1997)

    • Written in C++

    • Multiple models of computation

    • Hierarchical heterogeneity

    • Dataflow variants: BDF, DDF, PN

    • C/VHDL/DSP code generators

    • Optimizing SDF schedulers

    • Higher-order components

  • Ptolemy II (1996-2022)

    • Written in Java

    • Domain polymorphism

    • Multithreaded

    • Network integrated

    • Modal models

    • Sophisticated type system

    • CT, HDF, CI, GR, etc.

  • PtPlot (1997-??)

    • Java plotting package

  • Tycho (1996-1998)

    • Itcl/Tk GUI framework

  • Diva (1998-2000)

    • Java GUI framework

  • Copernicus (code generator)

  • KEPLER (2003-2028)

    • scientific workflow extensions

Source (Ptolemy): Edward Lee et al. http://ptolemy.eecs.berkeley.edu/


Introduction to scientific workflows and the kepler system

An “early” example: Promoter Identification SSDBM, AD 2003

  • Scientist models application as a “workflow” of connected components (“actors”)

  • If all components exist, the workflow can be automated/ executed

  • Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)


Why ptolemy ii and thus kepler

Why Ptolemy II (and thus KEPLER)?

  • Ptolemy II Objective:

    • “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

  • Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse

  • User-Orientation

    • Workflow design & exec console (Vergil GUI)

    • “Application/Glue-Ware”

      • excellent modeling and design support

      • run-time support, monitoring, …

      • not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)

      • but middle-/underware is conveniently accessible through actors!

  • PRAGMATICS

    • Ptolemy II is mature, continuously extended & improved, well-documented (500+pp)

    • open source system

    • Ptolemy II folks actively participate in KEPLER


The kepler ptolemy ii gui vergil

The KEPLER/Ptolemy II GUI (Vergil)

“Directors” define the component interaction & executionsemantics

Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)


Ptolemy ii actor oriented modeling

Ptolemy II: Actor-Oriented Modeling

  • Component (“actor”) interaction semantics not hard-wired inside components, but “factored out” in a “director”

  • Different directors for different modeling and execution needs (… can even be combined!)

  • Better abstraction, modeling, component reuse, …


Behavioral polymorphism in ptolemy

Director

Behavioral Polymorphism in Ptolemy

These polymorphic methods implement the communication semantics of a domain in Ptolemy II. The receiver instance used in communication issupplied by the director, not by the component.

(cf. CCA, WS-??, [G]BPL4??, … !)

IOPort

Behavioral polymorphism is the idea that components can be defined to operate with multiple models of computation and multiple middleware frameworks.

consumer

producer

actor

actor

Receiver

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/


Domains and directors semantics for component interaction

Domains and Directors: Semantics for Component Interaction

  • CI – Push/pull component interaction

  • CSP – concurrent threads with rendezvous

  • CT – continuous-time modeling

  • DE – discrete-event systems

  • DDE – distributed discrete events

  • FSM – finite state machines

  • DT – discrete time (cycle driven)

  • Giotto – synchronous periodic

  • GR – 2-D and 3-D graphics

  • PN – process networks

  • SDF – synchronous dataflow

  • SR – synchronous/reactive

  • TM – timed multitasking

For (finer-grained) concurrent jobs!?

For (coarse grained) Scientific Workflows!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/


Polymorphic actor components working across data types and domains

Polymorphic Actor Components Working Across Data Types and Domains

  • Actor Data Polymorphism:

    • Add numbers (int, float, double, Complex)

    • Add strings (concatenation)

    • Add complex types (arrays, records, matrices)

    • Add user-defined types

  • Actor Behavioral Polymorphism:

    • In dataflow, add when all connected inputs have data

    • In a time-triggered model, add when the clock ticks

    • In discrete-event, add when any connected input has data, and add in zero time

    • In process networks, execute an infinite loop in a thread that blocks when reading empty inputs

    • In CSP, execute an infinite loop that performs rendezvous on input or output

    • In push/pull, ports are push or pull (declared or inferred) and behave accordingly

    • In real-time CORBA, priorities are associated with ports and a dispatcher determines when to add

By not choosing among these when defining the component, we get a huge increment in component re-usability. But how do we ensure that the component will work in all these circumstances?

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/


Directors and combining different component interaction semantics

Directors and Combining Different Component Interaction Semantics

  • Possible app. in SWF:

  • time-series aware …

  • parameter-sweep aware …

  • MPI aware

  • XYZ aware …

  • … execution models

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/


Component composition interaction

Components linked via ports

Dataflow (and msg/ctl-flow)

Where is the component interaction semantics defined??

each component is its own director!

But still useful for special applications, e.g. parallel programs (MPI, …)

DIR1

DIR2

DIR3

DIR4

???

Component Composition & Interaction

Source: GRIST/SC4DEVO workshop, July 2004, Caltech


Cca via special look the other way director s

CCA!?

CCA via special (“look the other way”) Director(s)?

  • Dataflow in CCA

    • a CCA “convention” can be used to accommodate actor-oriented/dataflow modeling

  • CCA/Message Passing in KEPLER

    • Kepler/Ptolemy can be extended to accommodate message passing semantics (CSP is already in Ptolemy II)


Data control flow spectrum

Data/Control-Flow Spectrum

  • Data (tokens) flow

    • (almost) no other side effects

    • WYSIWYG (usually)

  • References flow

    • token reference type may be “http-get”, “ftp-get”, “hsi put”…

    • generic handling still possible

  • Application specific tokens flow

    • e.g. current Nimrod job management in Resurgence

    • “invisible contract” between components

    • Director is unaware of what’s going on … (sounds familiar? ;-)

  • Specific messages passing protocols (e.g., CSP, MPI)

    • for systems of tightly coupled components

message passing, control flow

“clean” data(=ctl)-flow

special tokens flow

“actor”


Kepler csp c ontributors s ponsors p rojects or loosely coupled c ommunicating s equential p ersons

Ilkay Altintas SDM, Resurgence

Kim Baldridge Resurgence, NMI

Chad Berkley SEEK

Shawn Bowers SEEK

Terence Critchlow SDM

Tobin Fricke ROADNet

Jeffrey Grethe BIRN

Christopher H. Brooks Ptolemy II

Zhengang Cheng SDM

Dan Higgins SEEK

Efrat Jaeger GEON

Matt Jones SEEK

Werner Krebs, EOL

Edward A. Lee Ptolemy II

Kai Lin GEON

Bertram Ludaescher SEEK, GEON, SDM, BIRN, ROADNet

Mark Miller EOL

Steve Mock NMI

Steve Neuendorffer Ptolemy II

Jing Tao SEEK

Mladen Vouk SDM

Xiaowen Xin SDM

Yang Zhao Ptolemy II

Bing Zhu SEEK

•••

KEPLER/CSP: Contributors, Sponsors, Projects(or loosely coupled Communicating Sequential Persons ;-)

Ptolemy II


Kepler an open collaboration

KEPLER: An Open Collaboration

  • Initiated by members from NSF SEEK and DOE SDM/SPA; now several other projects

  • Open Source (BSD-style license)

  • Intensive Communications:

    • Web-archived mailing lists

    • IRC (!)

  • Co-development:

    • via shared CVS repository

    • joining as a new co-developer (currently):

      • get a CVS account (read-only)

      • local development + contribution via existing KEPLER member

      • be voted “in” as a member/co-developer

  • Software & social engineering

    • How to better accommodate new groups/communities?

    • How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?


Geon dataset generation registration a co development in kepler

GEON Dataset Generation & Registration(a co-development in KEPLER)

% Makefile

$> ant run

SQL database access (JDBC)

Matt,Chad, Dan et al. (SEEK)

Efrat

(GEON)

Ilkay

(SDM)

Yang (Ptolemy)

Xiaowen (SDM)

Edward et al.(Ptolemy)


Kepler then

KEPLER then …


And kepler today

… so,you see,

scientific workflows need domain and data-polymorphic actors & must scale to HPC!

… and KEPLER today…

What’s

a poly-

morphic

actor?

What’s

a scientific

workflow?

What

is

HPC?

BTW: Kepler is NOT a GUI (Vergil is)


Kepler pedigree to be determined

KEPLER Pedigree (to be determined…)

Khoros

openDX

SCIRun

DiscoveryNet

AVS

Taverna

Gabriel

Ptolemy

Ptolemy II

KEPLER

Triana

Pegasus

  • Graphical dataflow environments

  • Problem solving environments

  • Grid workflows

Matrix


A few specific kepler features

A Few Specific Kepler Features


Web services actors ws harvester

Web Services  Actors (WS Harvester)

1

2

4

3

  •  “Minute-made” (MM) WS-based application integration

  • Similarly: MM workflow design & sharing w/o implemented components


Recent actor additions

Recent Actor Additions


Digression who are the clients

Digression: Who are the clients?

  • Domain scientists

    • C/Perl/Python/Java/WS/DB-enabled ones

    • others (e.g. visually-inclined rest of us?)

  • Goal: make the life better for both!

    • Workflow automation

    • Plumbing support

    • Execution monitoring, steering, runtime revision (pause-inspect-modify-resume cycle)


For the geoscientist geon mineral classification workflow

For the Geoscientist: GEON Mineral Classification Workflow


Inside the classifier

… inside the Classifier

BrowserUI actor w/ SVG client display


In kepler interactive session

in KEPLER (interactive session)

Source: Dan Higgins, Kepler/SEEK


In kepler w editable script

in KEPLER (w/ editable script)

Source: Dan Higgins, Kepler/SEEK


A closer look at dataflow or do you know what s going on under your carpet

A Closer Look at Dataflow … (or: Do you know what’s going on under your carpet? )

  • Dataflow: what you see is what you get (almost…)

  • Need for a general way to handle references!

control tokens flow, e.g., from “$”-actor to FileReader and ImageReader actors

actual dataflow is “under the carpet” and through handles (file system, GridFTP, scp, SRB, …)


Geon data registration ui

GEON Data Registration UI


Geon data registration in kepler

GEON Data Registration in KEPLER


Registered resources show up in vergil joint seek spa geon registry

Registered Resources show up in Vergil (joint SEEK, SPA, GEON, … Registry!?)


Data analysis biodiversity indices

Data Analysis: Biodiversity Indices


Introduction to scientific workflows and the kepler system

Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.


Introduction to scientific workflows and the kepler system

Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.


Introduction to scientific workflows and the kepler system

Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.


Re engineered piw w iteration constructs ad 2004

Re-engineered PIW w/ Iteration Constructs AD 2004

map(GenbankWS)

Input: {“NM_001924”, “NM020375”}

Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}


Streaming real time data

Streaming Real-time Data

Straightforward Example:

Laser Strainmeter Channels in;

Scientific Workflow;

Earth-tide signal out

Seismic Waveforms


Introduction to scientific workflows and the kepler system

ORB


Job management here nimrod

Job Management (here: NIMROD)

  • Job management infrastructure in place

  • Results database: under development

  • Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’04


Kepler today

KEPLER Today

  • Support for SWF life cycle

    • Design, share, prototype, run, monitor, deploy, …

  • Coarse-grained scientific workflows, e.g.,

    • web service actors, grid actors, command-line actors, …

  • Fine grained workflows and simulations, e.g.,

    • Database access, XSLT transformations, …

  • Kepler Extensions

    • SDM Center/SPA: support for data- and compute-intensive workflows!

    • real-time data streaming (ROADNet)

    • other special and generic extensions (e.g. GEON, SEEK)

  • Status

    • first release (alpha) was in May 2004

    • nightly builds w/ version tests

    • “Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …)

    • Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)


Kepler tomorrow

KEPLER Tomorrow

  • Application-drivenextensions:

    • access to/integration with other IDMAF components

      • SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, …

    • support for execution of new SWF domains

      • Astrophysics: TSI/Blondin (SPA/NCSU)

      • Nuclear Physics: Swesty (SPA/LLNL)

  • Generic extensions:

    • addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …)

    • (C-z; bg; fg)-ing (“detach” and reconnect)

    • workflow deployment models

  • Additional “domain awareness” (e.g. via new directors)

    • time series, parameter sweeps, job scheduling, …

    • hybrid type system with semantic types

  • Consolidation

    • More installers, regular releases, improved documentation, …


Desiderata for and features of scientific workflow automation

Desiderata for and Features of Scientific Workflow Automation

  • SWF design support

    • step-wise refinement, component/actor-oriented design, flow-oriented design, sharing (visual) design with others, …

    • better component reuse through actor-oriented modeling w/ (largely) independent directors

  • Rapid prototyping support

    • Web service actors and harvester

    • Shell/command line actor

    • Data transformations (e.g., via Perl, Python, XSLT, … actors)

  • Workflow “plumbing” support

    • data transformation actors e.g., in Perl, Python, XSLT, …

  • Runtime support

    • Execution monitoring

      • animation for SDF, planned “heartbeat” for PN, …

      • listening to and logging of token flow through ports and control messages of directors

    • Pause-inspect-modify-resume cycle


F i n

F I N

Additional material ahead


Research and development issues

Research (and Development) Issues

…some challenges and ideas…


Service composition orchestration and all that stuff

“Service Composition, Orchestration” and all that stuff

  • Instead of asking which WS-XXX solves this for you, ask: What is my WF composition problem?

  • Also: there is a good amount of previous work, most notably from the Ptolemy group itself:

    • How do you model systems as interacting components

    • How do you model component interaction

    • How can you make components and interaction patterns as reusable as possible

    •  Check out actor-oriented modeling and design!


Programming patterns higher order fp constructs

“Programming Patterns”(Higher-Order FP Constructs)


Introduction to scientific workflows and the kepler system

Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.


Introduction to scientific workflows and the kepler system

Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.


Introduction to scientific workflows and the kepler system

Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.


Introduction to scientific workflows and the kepler system

designed to fit

hand-crafted control solution; also: forces sequential execution!

designed to fit

[Altintas-et-al-PIW-SSDBM’03]

hand-crafted

Web-service actor

No data transformations available

Complex backward control-flow


A scientific workflow problem more solved computer scientist s view

Solution based on declarative, functional dataflow process network

(= also a data streaming model!)

Higher-order constructs: map(f)

no control-flow spaghetti

data-intensive apps

free concurrent execution

free type checking

automatic support to go from piw(GeneId) to

PIW :=map(piw) over [GeneId]

A Scientific Workflow Problem: More Solved (Computer Scientist’s view)

map(f)-style

iterators

Powerful type checking

Generic, declarative “programming” constructs

Generic data transformation actors

Forward-only, abstractable sub-workflow piw(GeneId)


A scientific workflow problem even more solved domain cs coming together

A Scientific Workflow Problem: Even More Solved (domain&CS coming together!)

map(GenbankWS)

Input: {“NM_001924”, “NM020375”}

Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}


A research problem optimization by rewriting

Example: PIW as a declarative, referentially transparent functional process

optimization via functional rewriting possible

e.g. map(fog) = map(f) o map(g)

Technical report &PIW specification in Haskell

A Research Problem: Optimization by Rewriting

map(fo g) instead ofmap(f) o map(g)

Combination of map and zip

http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf


Kepler today1

KEPLER Today

  • Support for SWF life cycle

    • Design, share, prototype, run, monitor, deploy, …

  • Coarse-grained scientific workflows, e.g.,

    • web service actors, grid actors, command-line actors, …

  • Fine grained workflows and simulations, e.g.,

    • Database access, XSLT transformations, …

  • Kepler Extensions

    • support for data- and compute-intensive workflows!

    • real-time data streaming (ROADNet)

    • other special and generic extensions (e.g. GEON, SEEK)

  • Status

    • first release (alpha) was in May 2004

    • nightly builds w/ version tests

    • “Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …)

    • Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)


Kepler tomorrow1

KEPLER Tomorrow

  • Application-drivenextensions:

    • access to/integration with other IDMAF components

      • SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, …

    • support for execution of new SWF domains

      • Astrophysics: TSI/Blondin (SPA/NCSU)

      • Nuclear Physics: Swesty (SPA/LLNL)

  • Generic extensions:

    • addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …)

    • (C-z; bg; fg)-ing (“detach” and reconnect)

    • workflow deployment models

  • Additional “domain awareness” (e.g. via new directors)

    • time series, parameter sweeps, job scheduling, …

    • hybrid type system with semantic types

  • Consolidation

    • More installers, regular releases, improved documentation, …


Towards a more concise presentation style

Towards a more concise Presentation Style …

Due to lack of time, some slides will be “by reference” only ;-)

…Each speaker was given four minutes to present his paper, as there were so many scheduled -- 198 from 64 different countries. To help expedite the proceedings, all reports had to be distributed and studied beforehand, while the lecturer would speak only in numerals, calling attention in this fashion to the salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only 22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that matter; Hazelton countered this objection with the crushing retort that, either way, 22. I turned to the number key in his paper and discovered that 22 meant the end of the world… [The Futurological Congress, Stanislaw Lem, translated from the Polish by Michael Kandel, Futura 1977]


References

References

  • Kepler: http://kepler-project.org

  • Ptolemy: http://ptolemy.eecs.berkeley.edu/

  • Flow-based Programming: http://www.jpaulmorrison.com/fbp/index.shtml

  • Wiki with links to others: http://www.jpaulmorrison.com/cgi-bin/wiki.pl

    • http://c2.com/cgi/wiki?FlowBasedProgramming

    • http://c2.com/cgi/wiki?DataflowProgramming

    • http://c2.com/cgi/wiki?ActorsModel


  • Login