Scientific Workflows Based on
1 / 64

Bertram Lud ä scher San Diego Supercomputer Center [email protected] - PowerPoint PPT Presentation

  • Uploaded on

Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler ) (or Workflow Considered Harmful …). Bertram Lud ä scher San Diego Supercomputer Center [email protected] Overview. Scientific Workflow (SWF) Examples SWF Requirements & Characteristics

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Bertram Lud ä scher San Diego Supercomputer Center [email protected]' - hawa

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)(or Workflow Considered Harmful …)

Bertram Ludäscher

San Diego Supercomputer Center

[email protected]


  • Scientific Workflow (SWF) Examples

  • SWF Requirements & Characteristics

  • Workflow standardsconsidered harmful for SWF!?

  • Dataflow Process Networks (Ptolemy II)

  • Scientific Workflows (Kepler = Ptolemy II + X)

Acknowledgements i


GEOsciences Network (NSF)

Biomedical Informatics Research Network (NIH)

Science Environment for Ecological Knowledge (NSF)

Scientific Data Management Center (DOE)

Acknowledgements I

Acknowledgements ii

Ilkay Altintas SDM

Chad Berkley SEEK

Shawn Bowers SEEK

Jeffrey Grethe BIRN

Christopher H. Brooks Ptolemy II

Zhengang Cheng SDM

Efrat Jaeger GEON

Matt Jones SEEK

Edward A. Lee Ptolemy II

Kai Lin GEON

Bertram Ludaescher BIRN, GEON, SDM, SEEK

Stephen Neuendorffer Ptolemy II

Mladen Vouk SDM

Yang Zhao Ptolemy II

Coming soon!?:

ROADNet, myGrid, GriPhyN, ...

Acknowledgements II

Ptolemy II

Promoter identification workflow piw
Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)






in Ptolemy-II


Garp invasive species pipeline


To Ecogrid













Test sample (d)


presence &

absence points

(native range)


Native range prediction

map (f)

Training sample



rule set



















Model quality

parameter (g)





(native range) (c)






Environmental layers (native

range) (b)


area prediction map (f)



maps (h)

Model quality

parameter (g)

Integrated layers

(invasion area) (c)

Environmental layers (invasion area) (b)

Species presence &absence points (invasion area) (a)

GARP Invasive Species Pipeline

Source: NSF SEEK (Deana Pennington et. al, UNM)

A look inside classification
A Look Inside Classification

Finer granularity

Extracted from the mineral composition and this level’s diagram coordinates.

Classifier: Locates the point’s region.

Diagrams information and transitions between them.

SVG to polygons.

Displays the point in the diagram for this level.

Swf requirements characteristics
SWF Requirements & Characteristics

  • Scientist friendly "problem solving environment"

    • WF design

    • WF execution

    • WF steering and UI

      • pause; revise; resume; rollback (cf. SCIRun)

    • repositories of reusable components

    • data and WF provenance (virtual data concept)

      • logging, cache reuse/partial re-derive, reports, …

    • Conceptual modeling support

      • complex data (semantics) support

      • “wiring” support (cf. web service composition)

      • planning support

Swf requirements characteristics1
SWF Requirements & Characteristics

  • "Modeling" support

    • Abstraction, hierarchical modeling

    • Models of Computation (MoC)

    • component interaction; combination of MoCs (cf. CCA)

    • WF multi-grain/granola: powder to bolders (and back)

      • Boolean (N)AND, (N)OR,… vs. chaining together Grid-apps

    • Rich data structures and type systems

  • End user "programming" support

    • high-level programming constructs

      • e.g. map/3 for iteration, filter, select, branch, merge, ...

    • data transformations

    • legacy tool integration (plug-ins)

    • data streaming

      • How to tame (e.g., starve a dataflow; then resume)?

         Zauberlehrling’sproblem

Swf requirements characteristics2
SWF Requirements & Characteristics

  • Grid-enabling SWFs

    • transparent use of (remote) resources

    • big data

    • big computation requirements

    • early/late binding of logical to physical resources, …

    • planning, scheduling, …

       cf. Chimera, Pegasus, DAGman, Condor(-G)

Scientific workflows some findings
Scientific Workflows: Some Findings

  • More dataflow than (business) workflow

    • but some branching looping, merging, …

    • not: documents/objects undergoing modifications

    • instead often: dataset-out = analysis(dataset-in)

  • Need for “programming extension”

    • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)

  • Need for abstraction and nested workflows

  • Need for data transformations (compute/transform alternations)

  • Need for rich user interaction & workflow steering:

    • pause / revise / resume

    • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF

  • Need for high-throughput transfers (“grid-enabling”, “streaming”)

  • Need for persistence of intermediate products

     data provenance (“virtual data” concept)

A zoo of workflow standards and systems
A ZOO of Workflow Standards and Systems

Source: W.M.P. van der Aalst et al.

Business workflows
Business Workflows

  • Business Workflows

    • show their office automation ancestry

    • documents and “work-tasks” are passed

    • no data streaming, no data-intensive pipelines

    • lots of standards to choose from: WfMC, WSFL, BMPL, BPEL4WS,.. XPDL,…

    • but often no clear execution semantics for constructs as simple as this:

Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002

On workflow standards
On Workflow Standards…

Workflow standards debunked
Workflow “Standards” Debunked

Source: Don’t go with the flow:Web services composition standards exposed,W.M.P. van der Aalst, Trends & Controversies, Jan/Feb 2003 issue of IEEE Intelligent Systems Web Services - Been there done that?

Workflow standards debunked1
Workflow “Standards” Debunked

Source: Don’t go with the flow:Web services composition standards exposed,W.M.P. van der Aalst, Trends & Controversies, Jan/Feb 2003 issue of IEEE Intelligent Systems Web Services - Been there done that?

But never mind the standards discussion many scientific workflows are dataflows

But never mind the standards discussion:Many Scientific Workflows are Dataflows!

(Check YOUR examples …)

Scirun component based problem solving environments for large scale scientific computing
SCIRun: Component-Based Problem Solving Environments for Large-Scale Scientific Computing

  • SCIRun: problem solving environment for interactive construction, debugging, and steering of large-scale scientific computations

  • Component model, based on generalized dataflow programming

  • Contact: Steve Parker (; SciDAC/SDM collaboration

Workflow and distributed computation grid created with Kensington Discovery Edition from InforSense.

Dataflow process networks putting computation models first

typed i/o ports Kensington Discovery Edition from InforSense.




Dataflow Process Networks:Putting Computation Models first!

  • Synchronous Dataflow Network (SDF)

    • Statically schedulable single-threaded dataflow

      • Can execute multi-threaded, but the firing-sequence is known in advance

    • Maximally well-behaved, but also limited expressiveness

  • Process Network (PN)

    • Multi-threaded dynamically scheduled dataflow

    • More expressive than SDF (dynamic token rate prevents static scheduling)

    • Natural streaming model

  • Other Execution Models (“Domains”)

    • Implemented through different “Directors”

advanced push/pull

Dataflow process networks and ptolemy ii

see! Kensington Discovery Edition from InforSense.

Dataflow Process Networks and Ptolemy-II



Source: Edward Lee et al.

Why ptolemy ii
Why Ptolemy-II? Kensington Discovery Edition from InforSense.

  • PTII Objective:

    • “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

  • Data & Process oriented:

    • Dataflow process networks

  • Natural Data Streaming Support

  • End user “WF console” (Vergil GUI)


    • mature, actively maintained, well-documented

    • open source system

    • leverage “sister projects” activities (e.g. SEEK, SDM, BIRN,…)

Source: Edward Lee et al.

Source: Edward Lee et al.

Marrying divorcing control dataflow
Marrying & Divorcing Control- & Dataflow

Source: Edward Lee et al.

Another goodie ptolemy ii type system
Another Goodie: Ptolemy-II Type System

Support for multiple workflow granularities
Support for Multiple Workflow Granularities





Sand to Rocks


Scientific workflows dataflow process networks x
Scientific Workflows = Dataflow Process Networks + X

  • X = …

    • Database plug-ins

    • Legacy application plug-ins (via command line, as web services, …)

    • Grid extensions:

      • Actors as web/grid services

      • 3rd party data transfer, high-throughput data streaming

      • Dealing with thousands of files (cf. astrophysics, astronomy, HEP, … examples)

      • Data and servicerepositories, discovery Extended type system (structural & semantic extensions)

    • Programmingextensions (declarative/FP) and

    • Rich user interactions/workflow steering

    • Rich data transformations (compute/transform alternations)

    • Data provenance

      • (semi-)automatic meta-data creation

Kepler = Ptolemy-II + X

Status update specific tasks for kepler done ongoing new
Status update / specific tasks for Kepler$DONE, %ONGOING, *NEW

  • User interaction, workflow steering

    • $ Pause/revise/resume

    • $ BrowserUI actor (browser as a 0-learning display and selection tool)

  • Distributed execution

    • $ Dynamically port-specializing WSDL actor

    • * Dynamically specializing Grid service actor

  • Port & actor type extensions (SEEK leverage)

    • * Structural types (XML Schema)

    • * Semantic types (OWL) incl. unit types w/ automatic conversion

  • Programming extensions

    • % Data transformation actors (XSLT, XQuery, Python, Perl,…)

    • * map, zip, zipWith, …, loop, switch “patterns”

  • Specialized Data Sources

    • $ EML (SEEK),

    • % MS Access (GEON), *JDBC,

    • *XML, *NetCDF, …

Some specific tasks for kepler all new
Some specific tasks for Kepler NEW)

  • Design & develop transparent, Grid-enabled PNs:

    • Communication protocol details

    • Grid-actor extensions and/or

    • Grid-Process Network director (G-PN)

    • Host/Source-location becomes actor parameter

      • add “active-inline” parameter display for grid-actors (@exec-loc), channels (@transport-protocol), source-actors (@{src-loc|catalog-loc})

  • Activity Monitoring

    • Add “activity status” display (green, yellow, red) to replace PtII animation (needed for concurrently executing PN!)

  • Registration & Deployment mechanisms

    • Actor/Data/Workflow repository (=composite actors)

    • Shows up as (config’able) actor library

    • OGSA Service Registry approach? (SEEK leverage; UDDI complex & limited says MattJ)


  • Extensions to deal with failures (fault tolerance)

Example database actors for ptolemy ii

Example: Database actors for Ptolemy II

(Kepler-GEON; Efrat Jaeger)

Database actors
Database Actors

  • Database Connection actor:

  • Database Query actor:

Database actors example
Database Actors Example

Example web service enabling ptolemy ii

Example: Web service-enabling Ptolemy II

(Kepler-SDM; Ilkay Altintas)

A generic web service actor

Configure - select service


Configure – select WSDL url

from repository

A Generic Web Service Actor

Set parameters and commit specialized actor
Set Parameters and CommitSpecialized Actor

Set parameters

and commit

Web service actor after instantiation
Web Service Actor Instantiation

Composing third party web services

Output of previous

web service

Composing Third-Party Web Services

Input of next

web service

User interaction &


Results of the execution
Results of the Execution

User I/O via standard brower!

Run Window /

WF Deployment

Composing legacy applications here phylogeny shell command line actors
Composing Legacy Applications (here: Phylogeny): / Command-Line Actors

Example grid enabling ptolemy ii

Example: Grid-enabling Ptolemy II

( Kepler-SEEK, Chad Berkley

Kepler-SDM, Ilkay Altintas,

… myGrid?, …

…GriPhyN?, …

… OGS{I|A}-[DAI] ...)

Transparently grid enabling ptii handles
Transparently Grid-Enabling PTII: Handles

Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion.

  • AGA: get_handle

  • GAA: return &X

  • AB: send &X

  • BGB: request &X

  • GBGA: request &X

  • GA GB: send *X

  • GBB: send done(&X)

  • Example:

  • &X = “GA.17”

  • *X =<some_huge_file>

PTII space









Grid space




Transparently grid enabling ptii
Transparently Grid-Enabling PTII

  • Different phases

    • Register designed WF (could include external validation service)

    • Find suitable grid service hosts for actors

    • Pre-stage execution

    • Execute (w/ provenance)

      • Interactively steer (pause; revise; resume)

      • Batch process; re-run parts later

    • Register/store data products and execution logs

  • Kepler implementation choices:

    • Grid-actors (no change of Director necessary!?) and/or

    • Grid-(PN)-director (also need to change actors!?)

    • Add grid service host id as actor parameter: [email protected]

    • Similar for data: [email protected]

C z bf detach your wf execution
“C-z ; bf &” – Detach your WF execution!

  • Currently in PTII

    • tight coupling of WF execution and PTII Java client (also Vergil GUI)

  • To-do for Kepler:

    • detaching WF console (Vergil) from a Grid-aware execution engine

Grid-PN Director!

Transport protocol


Data location parameter

Host location


Semantic type enabling ptolemy ii owl here we go

Semantic Type-enabling Ptolemy II – here we go… ;-)

(Kepler-SEEK; Shawn Bowers)

Semantic type extensions
Semantic Type Extensions

  • Take concepts and relationships from an ontology to “semantically type” the data-in/out ports

  • Application: e.g., design support:

    • smart/semi-automatic wiring, generation of “massaging actors”





Takes Abundance Count

Measurements for Life Stages

Returns Mortality Rate Derived

Measurements for Life Stages

Semantic types
Semantic Types

  • The semantic type signature

    • Type expressions over the (OWL) ontology





SemType m1 ::

Observation & itemMeasured.AbundanceCount &



DerivedObservation & itemMeasured.MortalityRate &


Extended type system here owl semantic types
Extended Type System OWL Semantic Types)

SemType m1 ::

Observation & itemMeasured.AbundanceCount &


 DerivedObservation & itemMeasured.MortalityRate & hasContext.appliesTo.LifeStageProperty

Substructure association:

XML raw-data =(X)Query=> object model =link => OWL ontology

Programming extensions

Programming Extensions

(some lessons from SciDAC/SSDBM demo)

designed to fit

designed to fit


Web-service actor




in Ptolemy-II


hand-crafted control solution; also: forces sequential execution!

No data transformations available

Complex backward control-flow

Promoter identification workflow in fp
Promoter Identification Workflow in FP

genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> Stringd0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file

Cleaned up process network piw

Back to purely functional dataflow process network

(= also a data streaming model!)

Re-introducing map(f) to Ptolemy-II (was there in PT Classic)

no control-flow spaghetti

data-intensive apps

free concurrent execution

free type checking

automatic support to go from piw(GeneId) to

PIW :=map(piw) over [GeneId]

Cleaned up Process Network PIW



Powerful type checking

Generic, declarative “programming” constructs

Generic data transformation actors

Forward-only, abstractable sub-workflow piw(GeneId)

Optimization by declarative rewriting i

PIW as a declarative, referentially transparent functional process

optimization via functional rewriting possible

e.g. map(fog) = map(f) o map(g)


Technical report &PIW specification in Haskell

Optimization by Declarative Rewriting I

map(fo g) instead ofmap(f) o map(g)

Combination of map and zip

Optimizing ii streams pipelines
Optimizing II: Streams & Pipelines process

  • Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f• mapS g mapS (f • g)

Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney

Summary process

  • Many (most of ours anyways) scientific workflows are dataflows

    • lots of workflow “standards” (messy and not focused on SWF problems)

    • should we start a new wave of dataflow standards??

  • Importance of clear semantics for

    • different MoCs (models of computation: PN, SDF, DE, CT, …)

    • component composition across MoCs

    • component interaction

    •  Ptolemy II directors

  • Kepler:

    • Based on extensible Ptolemy II system

    • Cross-project activity (SEEK, SDM, Ptolemy II, GEON, BIRN, and counting)

    • Plug-in / interface with your SWF planner, execution engine, grid-WF tool!

A note on the style of these slides

A Note on the Style of these Slides process

Due to lack of time, most of the following slides are “by reference” only ;-)

…Each speaker was given four minutes to present his paper, as there were so many scheduled -- 198 from 64 different countries. To help expedite the proceedings, all reports had to be distributed and studied beforehand, while the lecturer would speak only in numerals, calling attention in this fashion to the salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only 22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that matter; Hazelton countered this objection with the crushing retort that, either way, 22. I turned to the number key in his paper and discovered that 22 meant the end of the world… [The Futurological Congress, Stanislaw Lem, translated from the Polish by Michael Kandel, Futura 1977]

F i n words to from the wise


In "Flow-Based Programming" (FBP), applications are defined as networks of "black box" processes, which exchange data across predefined connections. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. It is thus naturally component-oriented. To describe this capability, the distinguished IBM engineer, Nate Edwards, coined the term "configurable modularity", which he calls the basis of all true engineered systems.

When using FBP, the application developer works with flows of data, being processed asynchronously, rather than the conventional single hierarchy of sequential, procedural code.   It is thus a good fit with multiprocessor computers, and also with modern embedded software. In many ways, an FBP application resembles more closely a real-life factory, where items travel from station to station, undergoing various transformations.  Think of a soft drink bottling factory, where bottles are filled at one station, capped at the next and labelled at yet another one.  FBP is therefore highly visual: it is quite hard to work with an FBP application without having the picture laid out on one's desk, or up on a screen!  For an example, see Sample DrawFlow Diagram.

Strangely though, in spite of being at the leading edge of application development, it is also simple enough that trainee programmers can pick it up, and it is a much better match with the primitives of data processing than the conventional primitives of procedural languages. The key, of course (and perhaps the reason why it hasn't caught on more widely), is that it involves a significant paradigm shift that changes the way you look at programming, and once you have made this transition, you find you can never go back!

FBP seems to dovetail neatly with a concept that I call "smart data". There is a section on this in stuff about the author. A new web page on this topic has just been uploaded - see "Smart Data" and Business Data Types - and we will be publishing more as it develops.

F I N: Words to/from the Wise

FYI: Flow-based programming has been re-discovered/re-invented several times by different communities. Here is an “IBM practitioner’s view”:

Flow-based Programming,