1 / 33

An Extensible System for Design and Execution of Scientific Workflows

An Extensible System for Design and Execution of Scientific Workflows. San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD). Kepler (UCSD and UCDavis ). Scientific workflow management system based on Ptolemy II

dirk
Download Presentation

An Extensible System for Design and Execution of Scientific Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Extensible System for Design and Execution of Scientific Workflows San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD)

  2. Kepler (UCSD and UCDavis) • Scientific workflow management system based on Ptolemy II • Allows scientists to visually design and execute scientific workflows • Actor-oriented model with directors acting as the main workflow engine • Enables different models of computation

  3. What is Kepler? • Modeling flow of data from one step to another in series of computations to achieve some scientific goal

  4. Ptolemy II • Software system for modeling, simulation, and design of concurrent, real-time, embedded systems developed at UC Berkeley • Objective: “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

  5. Kepler 101

  6. Structure of Kepler Workflows • Directors • Actors • Ports • Relations

  7. The Director/Actor Metaphor • Directors control execution of workflow • Actors are executable components of a workflow (scheduling, dispatching threads, etc)  Directors govern execution of Actors

  8. Actor-/Dataflow Orientation vs Object-/ Control flow Orientation

  9. Directors • Every Kepler workflow needs a director • Execute networks of components under multiple execution models • Synchronous vs. Parallel vs. Dataflow vs. time-based vs. event-based vs. all combined • Computation model dictates semantics for component interaction

  10. Directors cont. • Make use of separation of concerns • e.g., component execution, workflow execution and provenance tracking • Managers acts like “common execution environment” • governing different concerns related to execution of network and services

  11. Common Director Types • CT – continuous time modeling • DE – discrete event systems • FSM – finite state machines • PN – process networks • SDF – synchronous dataflow • DDF – dynamic dataflow • SR - synchronous/reactive systems

  12. Actors • Reusable components that execute variety of functions • Communicate with other actors in workflow through ports • Composite actor – aggregation of actors • Composite actor may have a local director

  13. Composite Actors: Using Hierarchy to Hide Complexity • Top level workflows can be conceptual representation of science process • Drilling down reveals increasing levels of detail • Composing models using hierarchy promotes development of re-usable components

  14. Actor Implementation • Each actor implements several methods • initialize() – initializes state variables • prefire() – indicates if actor wants to fire • fire() – main point of execution • Read inputs, produce outputs, read parameter values • postfire() – update persistent state, see if execution complete • wrapup() • Each director calls these methods according to its model

  15. Example Actors • Copy actor– copy files from one resource to another during execution • Stage actor – local to remote host • Fetch actor - remote to local host • Job execution actor – submit and run a remote job • Monitoring actor – notify user of failures • Service discovery actor – import web services from a service repository or web site • Rexpression actors • MatlabExpression actors • Web services actors – Given WSDL and name of an operation of a web service, dynamically customizes itself to implement and execute that method • Database connection and query actors

  16. Ports • Ports used to produce and consume data and communicate with other actors in workflow • Input port – data consumed by actor • Output port – data produced by actor • Input/output port – data both produced and consumed

  17. Relations • Direct same input or output to more than one port • Example: direct output to • display actor to show intermediate results, and • operational actor for further processing

  18. Design & Execution of Kepler Workflows • Execution Options: • inside GUI • at command-line • distributed computing

  19. The KEPLER GUI: Vergil

  20. Application Examples: Mineral Classification with Kepler(Efrat Jaeger, GEON)

  21. Sharing Workflows • Kepler components can be shared by exporting workflow or component into a Kepler Archive (KAR) file (extension of JAR file format) • Component Repository is centralized system for sharing Kepler workflows • Users can search for components from repository from within Vergil

  22. Access to Scientific Data • Kepler provides direct access to scientific data archived in many of commonly used data archives. • Ex. access to data stored in Knowledge Network for Biocomplexity (KNB) Metacat server and described using Ecological Metadata Language. • Additional supported data sources • DiGIR protocol, OPeNDAP protocol, GridFTP, JDBC, SRB, and others.

  23. Grid Implementations • Kepler ships by default with: • Globus actors • GridFTP actors • No BES implementation* • Job submission to openPBS, G-lite • Kepler actors capable of using Unicore by Euforia (Poznań SC) • TeraGrid gateways exists that use Kepler

  24. The End

  25. BONUS SLIDES

  26. Types • Actor Data Polymorphism: • Add numbers (int, float, double, complex) • Add strings (concatenation) • Add complex types (arrays, records, matrices) • Add user-defined types

  27. Workflow can contain Cycles

  28. R&D from ‘07 • Distributed execution of workflow parts (peer to peer) • Efficient data transfer • Provenance tracking of data and processes • Tracking workflow evolution • Streaming data analysis • Easy-to-deploy batch interfaces • Intuitive workflow design • Customizable semantic typing • Interoperability with other workflow and analytical environments (at exec level)

  29. Ecology • SEEK: Ecological Niche Modeling and climate change • REAP: Modeling parasite invasions in grasslands using sensor networks • NEON: Ecological sensor networks; COMET: Environmental science • Geosciences • GEON: LiDAR data processing, Geological data integration • NEESit: Earthquake engineering • Molecular biology • SDM: Gene promoter identification and ScalaBLAST • ChIP-chip: Genome-scale research; CAMERA: Metagenomics • Oceanography • REAP: SST data processing; LOOKING/OOI CI: ocean observing CI • ROADNet: real-time data modeling and analysis • ATOL: Processing Phylodata ; CiPRES: Phylogentic tools • Chemistry • Resurgence: Computational chemistry; DART/ARCHER: X-Ray crystallography • Library science • DIGARCH: Digital preservation; UK Text Mining Center: Cheshire feature and archival • Conservation biology • SanParks: Thresholds of Potential Concerns • Physics • SDM: astrophysics TSI-1 and TSI-2 ; CPES: Plasma fusion simulation; ITER-EU: ITM fusion workflows

More Related