Bertram Lud ä scher San Diego Supercomputer Center ludaesch@SDSC

Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)(or Workflow Considered Harmful …) Bertram Ludäscher San Diego Supercomputer Center ludaesch@SDSC.edu

Overview • Scientific Workflow (SWF) Examples • SWF Requirements & Characteristics • Workflow standardsconsidered harmful for SWF!? • Dataflow Process Networks (Ptolemy II) • Scientific Workflows (Kepler = Ptolemy II + X)

NSF, NIH, DOE GEOsciences Network (NSF) www.geongrid.org Biomedical Informatics Research Network (NIH) www.nbirn.net Science Environment for Ecological Knowledge (NSF) seek.ecoinformatics.org Scientific Data Management Center (DOE) sdm.lbl.gov/sdmcenter/ Acknowledgements I

Ilkay Altintas SDM Chad Berkley SEEK Shawn Bowers SEEK Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Efrat Jaeger GEON Matt Jones SEEK Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher BIRN, GEON, SDM, SEEK Stephen Neuendorffer Ptolemy II Mladen Vouk SDM Yang Zhao Ptolemy II … Coming soon!?: ROADNet, myGrid, GriPhyN, ... Acknowledgements II Ptolemy II

Promoter Identification Workflow (PIW) Source: Matt Coleman (LLNL)

Execution Semantics Promoter Identification Workflow in Ptolemy-II (SSDBM’03)

Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) GARP Invasive Species Pipeline Source: NSF SEEK (Deana Pennington et. al, UNM)

Rock & Mineral Classification Workflow

A Look Inside Classification Finer granularity Extracted from the mineral composition and this level’s diagram coordinates. Classifier: Locates the point’s region. Diagrams information and transitions between them. SVG to polygons. Displays the point in the diagram for this level.

Source: NIH BIRN (Jeffrey Grethe, UCSD)

SWF Requirements & Characteristics • Scientist friendly "problem solving environment" • WF design • WF execution • WF steering and UI • pause; revise; resume; rollback (cf. SCIRun) • repositories of reusable components • data and WF provenance (virtual data concept) • logging, cache reuse/partial re-derive, reports, … • Conceptual modeling support • complex data (semantics) support • “wiring” support (cf. web service composition) • planning support

SWF Requirements & Characteristics • "Modeling" support • Abstraction, hierarchical modeling • Models of Computation (MoC) • component interaction; combination of MoCs (cf. CCA) • WF multi-grain/granola: powder to bolders (and back) • Boolean (N)AND, (N)OR,… vs. chaining together Grid-apps • Rich data structures and type systems • End user "programming" support • high-level programming constructs • e.g. map/3 for iteration, filter, select, branch, merge, ... • data transformations • legacy tool integration (plug-ins) • data streaming • How to tame (e.g., starve a dataflow; then resume)?  Zauberlehrling’sproblem

SWF Requirements & Characteristics • Grid-enabling SWFs • transparent use of (remote) resources • big data • big computation requirements • early/late binding of logical to physical resources, … • planning, scheduling, …  cf. Chimera, Pegasus, DAGman, Condor(-G)

Scientific Workflows: Some Findings • More dataflow than (business) workflow • but some branching looping, merging, … • not: documents/objects undergoing modifications • instead often: dataset-out = analysis(dataset-in) • Need for “programming extension” • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) • Need for abstraction and nested workflows • Need for data transformations (compute/transform alternations) • Need for rich user interaction & workflow steering: • pause / revise / resume • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for high-throughput transfers (“grid-enabling”, “streaming”) • Need for persistence of intermediate products  data provenance (“virtual data” concept)

A ZOO of Workflow Standards and Systems Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/

Business Workflows • Business Workflows • show their office automation ancestry • documents and “work-tasks” are passed • no data streaming, no data-intensive pipelines • lots of standards to choose from: WfMC, WSFL, BMPL, BPEL4WS,.. XPDL,… • but often no clear execution semantics for constructs as simple as this: Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002

On Workflow Standards… http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html

Workflow “Standards” Debunked Source: Don’t go with the flow:Web services composition standards exposed,W.M.P. van der Aalst, Trends & Controversies, Jan/Feb 2003 issue of IEEE Intelligent Systems Web Services - Been there done that?

But never mind the standards discussion:Many Scientific Workflows are Dataflows! (Check YOUR examples …)

Commercial Workflow/Dataflow Systems

SCIRun: Component-Based Problem Solving Environments for Large-Scale Scientific Computing • SCIRun: problem solving environment for interactive construction, debugging, and steering of large-scale scientific computations • Component model, based on generalized dataflow programming • Contact: Steve Parker (cs.utah.edu); SciDAC/SDM collaboration

Workflow and distributed computation grid created with Kensington Discovery Edition from InforSense.

typed i/o ports FIFO actor actor Dataflow Process Networks:Putting Computation Models first! • Synchronous Dataflow Network (SDF) • Statically schedulable single-threaded dataflow • Can execute multi-threaded, but the firing-sequence is known in advance • Maximally well-behaved, but also limited expressiveness • Process Network (PN) • Multi-threaded dynamically scheduled dataflow • More expressive than SDF (dynamic token rate prevents static scheduling) • Natural streaming model • Other Execution Models (“Domains”) • Implemented through different “Directors” advanced push/pull

see! Dataflow Process Networks and Ptolemy-II read! try! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Why Ptolemy-II? • PTII Objective: • “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” • Data & Process oriented: • Dataflow process networks • Natural Data Streaming Support • End user “WF console” (Vergil GUI) • PRAGMATICS • mature, actively maintained, well-documented • open source system • leverage “sister projects” activities (e.g. SEEK, SDM, BIRN,…)

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Marrying & Divorcing Control- & Dataflow Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Another Goodie: Ptolemy-II Type System

Support for Multiple Workflow Granularities Bolders Plumbing Powder Abstraction: Sand to Rocks Sand

Scientific Workflows = Dataflow Process Networks + X • X = … • Database plug-ins • Legacy application plug-ins (via command line, as web services, …) • Grid extensions: • Actors as web/grid services • 3rd party data transfer, high-throughput data streaming • Dealing with thousands of files (cf. astrophysics, astronomy, HEP, … examples) • Data and servicerepositories, discovery Extended type system (structural & semantic extensions) • Programmingextensions (declarative/FP) and • Rich user interactions/workflow steering • Rich data transformations (compute/transform alternations) • Data provenance • (semi-)automatic meta-data creation Kepler = Ptolemy-II + X

Status update / specific tasks for Kepler$DONE, %ONGOING, *NEW • User interaction, workflow steering • $ Pause/revise/resume • $ BrowserUI actor (browser as a 0-learning display and selection tool) • Distributed execution • $ Dynamically port-specializing WSDL actor • * Dynamically specializing Grid service actor • Port & actor type extensions (SEEK leverage) • * Structural types (XML Schema) • * Semantic types (OWL) incl. unit types w/ automatic conversion • Programming extensions • % Data transformation actors (XSLT, XQuery, Python, Perl,…) • * map, zip, zipWith, …, loop, switch “patterns” • Specialized Data Sources • $ EML (SEEK), • % MS Access (GEON), *JDBC, • *XML, *NetCDF, …

Some specific tasks for Kepler (all NEW) • Design & develop transparent, Grid-enabled PNs: • Communication protocol details • Grid-actor extensions and/or • Grid-Process Network director (G-PN) • Host/Source-location becomes actor parameter • add “active-inline” parameter display for grid-actors (@exec-loc), channels (@transport-protocol), source-actors (@{src-loc|catalog-loc}) • Activity Monitoring • Add “activity status” display (green, yellow, red) to replace PtII animation (needed for concurrently executing PN!) • Registration & Deployment mechanisms • Actor/Data/Workflow repository (=composite actors) • Shows up as (config’able) actor library • OGSA Service Registry approach? (SEEK leverage; UDDI complex & limited says MattJ) • http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf • Extensions to deal with failures (fault tolerance)

Example: Database actors for Ptolemy II (Kepler-GEON; Efrat Jaeger)

Database Actors • Database Connection actor: • Database Query actor:

Database Actors Example

Example: Web service-enabling Ptolemy II (Kepler-SDM; Ilkay Altintas)

Configure - select service operation Configure – select WSDL url from repository A Generic Web Service Actor

Set Parameters and Commit Specialized Actor Set parameters and commit

Web Service Actor after Instantiation

Output of previous web service Composing Third-Party Web Services Input of next web service User interaction & Transformations

Results of the Execution User I/O via standard brower! Run Window / WF Deployment

Composing Legacy Applications (here: Phylogeny): Shell / Command-Line Actors

Example: Grid-enabling Ptolemy II ( Kepler-SEEK, Chad Berkley Kepler-SDM, Ilkay Altintas, … myGrid?, … …GriPhyN?, … … OGS{I|A}-[DAI] ...)

Transparently Grid-Enabling PTII: Handles Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion. • AGA: get_handle • GAA: return &X • AB: send &X • BGB: request &X • GBGA: request &X • GA GB: send *X • GBB: send done(&X) • Example: • &X = “GA.17” • *X =<some_huge_file> PTII space 3 A B 4 7 2 1 5 Grid space GA GB 6

Transparently Grid-Enabling PTII • Different phases • Register designed WF (could include external validation service) • Find suitable grid service hosts for actors • Pre-stage execution • Execute (w/ provenance) • Interactively steer (pause; revise; resume) • Batch process; re-run parts later • Register/store data products and execution logs • Kepler implementation choices: • Grid-actors (no change of Director necessary!?) and/or • Grid-(PN)-director (also need to change actors!?) • Add grid service host id as actor parameter: A@GA • Similar for data: myDB@GA

“C-z ; bf &” – Detach your WF execution! • Currently in PTII • tight coupling of WF execution and PTII Java client (also Vergil GUI) • To-do for Kepler: • detaching WF console (Vergil) from a Grid-aware execution engine Grid-PN Director! Transport protocol parameter Data location parameter Host location parameter

Semantic Type-enabling Ptolemy II (OWL – here we go… ;-) (Kepler-SEEK; Shawn Bowers)

Semantic Type Extensions • Take concepts and relationships from an ontology to “semantically type” the data-in/out ports • Application: e.g., design support: • smart/semi-automatic wiring, generation of “massaging actors” m1 (normalize) p3 p4 Takes Abundance Count Measurements for Life Stages Returns Mortality Rate Derived Measurements for Life Stages

Bertram Lud ä scher San Diego Supercomputer Center ludaesch@SDSC