1 / 48

The KEPLER Scientific Workflow System

2. Outline . Project Overviewfrom Ptolemy II to KeplerWorkflow Modeling Issuesfrom Dataflow to Control-flow (CCA et al)Current Kepler Featuresfrom plumbing to distributed executionExample Workflowsfrom bioinformatics to geoinformaticsFuture Plansfrom today to tomorrow ;-). 3. What is a Sci

romaine
Download Presentation

The KEPLER Scientific Workflow System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. The KEPLER Scientific Workflow System

    2. 2 Outline Project Overview from Ptolemy II to Kepler Workflow Modeling Issues from Dataflow to Control-flow (CCA et al) Current Kepler Features from plumbing to distributed execution Example Workflows from bioinformatics to geoinformatics Future Plans from today to tomorrow ;-)

    3. 3 What is a Scientific Workflow (SWF)? Goals: automate a scientist’s repetitive steps (data analysis, data transformation, computational steps, …) can encompass data generation, aggregation, analysis, visualization (WF granularity) design, test, share, deploy, execute, reuse, … SWFs Typical requirements/characteristics: data-intensive and/or compute-intensive plumbing-intensive dataflow-oriented distribution (data, processing) user-interaction “in the middle”, … … vs. (C-z; bg; fg)-ing (“detach” and reconnect) advanced programming constructs (map(f), zip, takewhile, …) logging, provenance, “registering back” (intermediate) products… … easy to recognize a SWF when you see one!

    4. 4 Promoter Identification Workflow

    5. 5

    6. 6 Ecology: GARP Analysis Pipeline for Invasive Species Prediction

    7. 7

    8. Starting Point for SDM-Center/SPA + SEEK: Ptolemy II

    9. 9

    10. 10 Why Ptolemy II (and thus KEPLER)? Ptolemy II Objective: “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” Dataflow Process Networks w/ natural pipelining/streaming support User-Orientation Workflow design & exec console (Vergil GUI) “Application/Glue-Ware” excellent modeling and design support run-time support, monitoring, … not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …) but middle-/underware is conveniently accessible through actors! PRAGMATICS Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) open source system Ptolemy II folks actively participate in KEPLER

    11. 11 KEPLER: An Open Collaboration “Founding projects”: DOE SDM/SPA and NSF SEEK Open Source (BSD-style license) Intensive Communications: Web-archived mailing lists IRC (!) Co-development: via shared CVS repository joining as a new co-developer (currently): get a CVS account (read-only) local development + contribution via existing KEPLER member be voted “in” as a member/co-developer Software & social engineering How to better accommodate new groups/communities? How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?

    12. 12 KEPLER/CSP: Contributors, Sponsors, Projects (or loosely coupled Communicating Sequential Persons ;-) Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher BIRN, SDM, SEEK, GEON Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK •••

    13. 13 History Gabriel (1986-1991) Written in Lisp Aimed at signal processing Synchronous dataflow (SDF) block diagrams Parallel schedulers Code generators for DSPs Hardware/software co-simulators Ptolemy Classic (1990-1997) Written in C++ Multiple models of computation Hierarchical heterogeneity Dataflow variants: BDF, DDF, PN C/VHDL/DSP code generators Optimizing SDF schedulers Higher-order components Ptolemy II (1996-2022) Written in Java Domain polymorphism Multithreaded Network integrated Modal models Sophisticated type system CT, HDF, CI, GR, etc.

    14. 14 KEPLER then …

    15. 15 … and KEPLER today…

    16. 16 The KEPLER/Ptolemy II GUI (Vergil)

    17. 17 Actor-Oriented Design Object orientation:

    18. 18 Object-Oriented vs. Actor-Oriented Interfaces Actor/Dataflow Oriented

    19. 19 Ptolemy II: Actor-Oriented Modeling Component (“actor”) interaction semantics not hard-wired inside components, but “factored out” in a “director” Different directors for different modeling and execution needs (… can even be combined!) Better abstraction, modeling, component reuse, …

    20. 20 Behavioral Polymorphism in Ptolemy

    21. 21 Component Composition & Interaction Components linked via ports Dataflow (and msg/ctl-flow) Where is the component interaction semantics defined?? each component is its own director! But still useful for special applications, e.g. parallel programs (MPI, …)

    22. 22 Data/Control-Flow Spectrum Data (tokens) flow (almost) no other side effects WYSIWYG (usually) References flow token reference type may be “http-get”, “ftp-get”, “hsi put”… generic handling still possible Application specific tokens flow e.g. current Nimrod job management in Resurgence “invisible contract” between components Director is unaware of what’s going on … (sounds familiar? ;-) Specific messages passing protocols (e.g., CSP, MPI) for systems of tightly coupled components

    23. 23 CCA via special (“look the other way”) Director(s)?

    24. 24 Domains and Directors: Semantics for Component Interaction CI – Push/pull component interaction CSP – concurrent threads with rendezvous CT – continuous-time modeling DE – discrete-event systems DDE – distributed discrete events FSM – finite state machines DT – discrete time (cycle driven) Giotto – synchronous periodic GR – 2-D and 3-D graphics PN – process networks SDF – synchronous dataflow SR – synchronous/reactive TM – timed multitasking

    25. 25 Polymorphic Actor Components Working Across Data Types and Domains Actor Data Polymorphism: Add numbers (int, float, double, Complex) Add strings (concatenation) Add complex types (arrays, records, matrices) Add user-defined types Actor Behavioral Polymorphism: In dataflow, add when all connected inputs have data In a time-triggered model, add when the clock ticks In discrete-event, add when any connected input has data, and add in zero time In process networks, execute an infinite loop in a thread that blocks when reading empty inputs In CSP, execute an infinite loop that performs rendezvous on input or output In push/pull, ports are push or pull (declared or inferred) and behave accordingly In real-time CORBA, priorities are associated with ports and a dispatcher determines when to add

    26. 26 Directors and Combining Different Component Interaction Semantics

    27. A Few Specific Kepler Features and Example Workflows

    28. 28 Web Services ? Actors (WS Harvester)

    29. 29 Recent Actor Additions

    30. 30 Digression: Who are the clients? Domain scientists C/Perl/Python/Java/WS/DB-enabled ones others (the rest of us?) Goal: make the life better for both categories! Workflow automation Plumbing support Execution monitoring, steering, runtime revision (pause-inspect-modify-resume cycle)

    31. 31 GEON Mineral Classification Workflow This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine. The ModalData provides the mineral info.This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine. The ModalData provides the mineral info.

    32. 32 … inside the Classifier This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine. The ModalData provides the mineral info.This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine. The ModalData provides the mineral info.

    33. 33 GEON Dataset Generation & Registration (and co-development in KEPLER)

    34. 34 GEON Data Registration UI

    35. 35 GEON Data Registration in KEPLER

    36. 36 Registered Resources show up in Vergil (joint SEEK, SPA, GEON, … Registry!?)

    37. 37 Data Analysis: Biodiversity Indices

    38. 38

    39. 39

    40. 40

    41. 41 Re-engineered PIW w/ Iteration Constructs AD 2004

    42. 42 Streaming Real-time Data

    43. 43

    44. 44 Job Management (here: NIMROD)

    45. 45 KEPLER Today Support for SWF life cycle Design, share, prototype, run, monitor, deploy, … Coarse-grained scientific workflows, e.g., web service actors, grid actors, command-line actors, … Fine grained workflows and simulations, e.g., Database access, XSLT transformations, … Kepler Extensions SDM Center/SPA: support for data- and compute-intensive workflows! real-time data streaming (ROADNet) other special and generic extensions (e.g. GEON, SEEK) Status first release (alpha) was in May 2004 nightly builds w/ version tests “Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …) Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)

    46. 46 KEPLER Tomorrow Application-driven extensions: access to/integration with other IDMAF components SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, … support for execution of new SWF domains Astrophysics: TSI/Blondin (SPA/NCSU) Nuclear Physics: Swesty (SPA/LLNL) … Generic extensions: addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …) (C-z; bg; fg)-ing (“detach” and reconnect) workflow deployment models Additional “domain awareness” (e.g. via new directors) time series, parameter sweeps, job scheduling, … hybrid type system with semantic types Consolidation More installers, regular releases, improved documentation, …

    47. 47 KEPLER & SPA First alpha releases since May 2004

    48. 48 Breaking into the Parallel (MPI) and Stream Computing Worlds!? Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f • mapS g ? mapS (f • g)

    49. 49 Hybrid Types (Structure + Semantics) Services can be semantically compatible, but structurally incompatible

More Related