Norbert Podhorszki, Bertram Ludäscher Department of Computer Science University of California, Davis kepler-project.org UCDAVIS Department of Computer Science Experience with Fusion Workflows
New Challenges • The CPES project brought new challenges for Kepler and workflow automation people • Remote computations, services and tools • Long running simulations, large amounts of data • One-time-passwords • Workflow = “Glue” • Scientists only need to connect individual components together • Automate tedious processes (logins, copies of data, control, start-stop) • Do it reliably • Show what is going on
Workflows • Real-time Monitoring of Simulation: • Transfer current data set to a secondary resource • Execute short analysis/visualization routines • Display result • Archival and post-processing • Transfer, pack and archive data sets on the fly
Kepler actors for CPES • Job submission to various resource managers • Permanent SSH connection to perform tasks on a remote machine • Generalized actors (workflows themselves) for specified tasks: • Watch a remote directory for simulation timesteps • Execute an external command on a remote machine • Tar and archive data in large junks to HPSS • Transfer a remote image file and display on screen • Control a running SCIRun server remotely • Above actors do logging/checkpointing • the final workflow can be stopped / restarted
Plasma physics simulation on 2048 processors on Seaborg@NERSC (LBL) Gyrokinetic Toroidal Code (GTC) to study energy transport in fusion devices (plasma microturbulence) Generating 800GB of data (3000 files, 6000 timesteps, 267MB/timestep), 30+ hour simulation run Under workflow control: Monitor (watch) simulation progress (via remote scripts) Transfer from NERSC to ORNL concurrently with the simulation run Convert each file to HDF5 file Archive files to 4GB chunks into HPSS Transfer Convert Archive Archival Workflow Monitor
Future Plans • Currently we have specialized actors that should be generalized for other disciplines and systems • “watching for” simulation output • safe and robust transfer, recovery from failure • archiving to different MSS, with different security policies, robust to failures and maintenance periods • Next workflow is cyclic, not just streaming • couple two simulations on two resources, transfer data and control between them • use local job manager for code execution • What about provenance management? • main reason to use scientific workflow system e.g. in bioinformatics workflows – needed for debugging runs, interpreting results, etc.
Author: Tim McPhillips, UC Davis There is more, e.g., how to get from messy to neat & reusable designs?
The Answer (YMMV) • Collection-Oriented Modeling & Design (COMAD) • embrace an assembly line metaphor • data = taggednestedcollections • e.g. represented as flattened, pipelined token streams: