480 likes | 669 Views
2. Outline . Project Overviewfrom Ptolemy II to KeplerWorkflow Modeling Issuesfrom Dataflow to Control-flow (CCA et al)Current Kepler Featuresfrom plumbing to distributed executionExample Workflowsfrom bioinformatics to geoinformaticsFuture Plansfrom today to tomorrow ;-). 3. What is a Sci
E N D
1. The KEPLER Scientific Workflow System
2. 2 Outline Project Overview
from Ptolemy II to Kepler
Workflow Modeling Issues
from Dataflow to Control-flow (CCA et al)
Current Kepler Features
from plumbing to distributed execution
Example Workflows
from bioinformatics to geoinformatics
Future Plans
from today to tomorrow ;-)
3. 3 What is a Scientific Workflow (SWF)? Goals:
automate a scientist’s repetitive steps (data analysis, data transformation, computational steps, …)
can encompass data generation, aggregation, analysis, visualization (WF granularity)
design, test, share, deploy, execute, reuse, … SWFs
Typical requirements/characteristics:
data-intensive and/or compute-intensive
plumbing-intensive
dataflow-oriented
distribution (data, processing)
user-interaction “in the middle”, …
… vs. (C-z; bg; fg)-ing (“detach” and reconnect)
advanced programming constructs (map(f), zip, takewhile, …)
logging, provenance, “registering back” (intermediate) products…
… easy to recognize a SWF when you see one!
4. 4 Promoter Identification Workflow
5. 5
6. 6 Ecology: GARP Analysis Pipeline for Invasive Species Prediction
7. 7
8. Starting Point for SDM-Center/SPA + SEEK: Ptolemy II
9. 9
10. 10 Why Ptolemy II (and thus KEPLER)? Ptolemy II Objective:
“The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”
Dataflow Process Networks w/ natural pipelining/streaming support
User-Orientation
Workflow design & exec console (Vergil GUI)
“Application/Glue-Ware”
excellent modeling and design support
run-time support, monitoring, …
not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)
but middle-/underware is conveniently accessible through actors!
PRAGMATICS
Ptolemy II is mature, continuously extended & improved, well-documented (500+pp)
open source system
Ptolemy II folks actively participate in KEPLER
11. 11 KEPLER: An Open Collaboration “Founding projects”:
DOE SDM/SPA and NSF SEEK
Open Source (BSD-style license)
Intensive Communications:
Web-archived mailing lists
IRC (!)
Co-development:
via shared CVS repository
joining as a new co-developer (currently):
get a CVS account (read-only)
local development + contribution via existing KEPLER member
be voted “in” as a member/co-developer
Software & social engineering
How to better accommodate new groups/communities?
How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?
12. 12 KEPLER/CSP: Contributors, Sponsors, Projects(or loosely coupled Communicating Sequential Persons ;-) Ilkay Altintas SDM, Resurgence
Kim Baldridge Resurgence, NMI
Chad Berkley SEEK
Shawn Bowers SEEK
Terence Critchlow SDM
Tobin Fricke ROADNet
Jeffrey Grethe BIRN
Christopher H. Brooks Ptolemy II
Zhengang Cheng SDM
Dan Higgins SEEK
Efrat Jaeger GEON
Matt Jones SEEK
Werner Krebs, EOL
Edward A. Lee Ptolemy II
Kai Lin GEON
Bertram Ludaescher BIRN, SDM, SEEK, GEON
Mark Miller EOL
Steve Mock NMI
Steve Neuendorffer Ptolemy II
Jing Tao SEEK
Mladen Vouk SDM
Xiaowen Xin SDM
Yang Zhao Ptolemy II
Bing Zhu SEEK
•••
13. 13 History Gabriel (1986-1991)
Written in Lisp
Aimed at signal processing
Synchronous dataflow (SDF) block diagrams
Parallel schedulers
Code generators for DSPs
Hardware/software co-simulators
Ptolemy Classic (1990-1997)
Written in C++
Multiple models of computation
Hierarchical heterogeneity
Dataflow variants: BDF, DDF, PN
C/VHDL/DSP code generators
Optimizing SDF schedulers
Higher-order components
Ptolemy II (1996-2022)
Written in Java
Domain polymorphism
Multithreaded
Network integrated
Modal models
Sophisticated type system
CT, HDF, CI, GR, etc.
14. 14 KEPLER then …
15. 15 … and KEPLER today…
16. 16 The KEPLER/Ptolemy II GUI (Vergil)
17. 17 Actor-Oriented Design Object orientation:
18. 18 Object-Oriented vs.Actor-Oriented Interfaces Actor/Dataflow
Oriented
19. 19 Ptolemy II: Actor-Oriented Modeling Component (“actor”) interaction semantics not hard-wired inside components, but “factored out” in a “director”
Different directors for different modeling and execution needs (… can even be combined!)
Better abstraction, modeling, component reuse, …
20. 20 Behavioral Polymorphism in Ptolemy
21. 21 Component Composition & Interaction Components linked via ports
Dataflow (and msg/ctl-flow)
Where is the component interaction semantics defined??
each component is its own director!
But still useful for special applications, e.g. parallel programs (MPI, …)
22. 22 Data/Control-Flow Spectrum Data (tokens) flow
(almost) no other side effects
WYSIWYG (usually)
References flow
token reference type may be “http-get”, “ftp-get”, “hsi put”…
generic handling still possible
Application specific tokens flow
e.g. current Nimrod job management in Resurgence
“invisible contract” between components
Director is unaware of what’s going on … (sounds familiar? ;-)
Specific messages passing protocols (e.g., CSP, MPI)
for systems of tightly coupled components
23. 23 CCA via special (“look the other way”) Director(s)?
24. 24 Domains and Directors: Semantics for Component Interaction CI – Push/pull component interaction
CSP – concurrent threads with rendezvous
CT – continuous-time modeling
DE – discrete-event systems
DDE – distributed discrete events
FSM – finite state machines
DT – discrete time (cycle driven)
Giotto – synchronous periodic
GR – 2-D and 3-D graphics
PN – process networks
SDF – synchronous dataflow
SR – synchronous/reactive
TM – timed multitasking
25. 25 Polymorphic Actor Components Working Across Data Types and Domains Actor Data Polymorphism:
Add numbers (int, float, double, Complex)
Add strings (concatenation)
Add complex types (arrays, records, matrices)
Add user-defined types
Actor Behavioral Polymorphism:
In dataflow, add when all connected inputs have data
In a time-triggered model, add when the clock ticks
In discrete-event, add when any connected input has data, and add in zero time
In process networks, execute an infinite loop in a thread that blocks when reading empty inputs
In CSP, execute an infinite loop that performs rendezvous on input or output
In push/pull, ports are push or pull (declared or inferred) and behave accordingly
In real-time CORBA, priorities are associated with ports and a dispatcher determines when to add
26. 26 Directors and Combining Different Component Interaction Semantics
27. A Few Specific Kepler Features and Example Workflows
28. 28 Web Services ? Actors (WS Harvester)
29. 29 Recent Actor Additions
30. 30 Digression: Who are the clients? Domain scientists
C/Perl/Python/Java/WS/DB-enabled ones
others (the rest of us?)
Goal: make the life better for both categories!
Workflow automation
Plumbing support
Execution monitoring, steering, runtime revision (pause-inspect-modify-resume cycle)
31. 31 GEON Mineral Classification Workflow This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine.
The ModalData provides the mineral info.This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine.
The ModalData provides the mineral info.
32. 32 … inside the Classifier This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine.
The ModalData provides the mineral info.This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine.
The ModalData provides the mineral info.
33. 33 GEON Dataset Generation & Registration(and co-development in KEPLER)
34. 34 GEON Data Registration UI
35. 35 GEON Data Registration in KEPLER
36. 36 Registered Resources show up in Vergil (joint SEEK, SPA, GEON, … Registry!?)
37. 37 Data Analysis: Biodiversity Indices
38. 38
39. 39
40. 40
41. 41 Re-engineered PIW w/ Iteration Constructs AD 2004
42. 42 Streaming Real-time Data
43. 43
44. 44 Job Management (here: NIMROD)
45. 45 KEPLER Today Support for SWF life cycle
Design, share, prototype, run, monitor, deploy, …
Coarse-grained scientific workflows, e.g.,
web service actors, grid actors, command-line actors, …
Fine grained workflows and simulations, e.g.,
Database access, XSLT transformations, …
Kepler Extensions
SDM Center/SPA: support for data- and compute-intensive workflows!
real-time data streaming (ROADNet)
other special and generic extensions (e.g. GEON, SEEK)
Status
first release (alpha) was in May 2004
nightly builds w/ version tests
“Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …)
Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)
46. 46 KEPLER Tomorrow Application-driven extensions:
access to/integration with other IDMAF components
SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, …
support for execution of new SWF domains
Astrophysics: TSI/Blondin (SPA/NCSU)
Nuclear Physics: Swesty (SPA/LLNL)
…
Generic extensions:
addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …)
(C-z; bg; fg)-ing (“detach” and reconnect)
workflow deployment models
Additional “domain awareness” (e.g. via new directors)
time series, parameter sweeps, job scheduling, …
hybrid type system with semantic types
Consolidation
More installers, regular releases, improved documentation, …
47. 47 KEPLER & SPA First alpha releases since May 2004
48. 48 Breaking into the Parallel (MPI) and Stream Computing Worlds!? Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f • mapS g ? mapS (f • g)
49. 49 Hybrid Types (Structure + Semantics) Services can be semantically compatible, but structurally incompatible