80 likes | 210 Views
Experiences Running Seismic Hazard Workflows. Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF. Overview of SCEC Computational Problem. Domain: Probabilistic Seismic Hazard Analysis (PSHA)
E N D
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF
Overview of SCEC Computational Problem • Domain: Probabilistic Seismic Hazard Analysis (PSHA) • PSHA computation defined as workflow with two parts • “Strain Green Tensor Workflow” • ~10 jobs, largest are 4000 core/hrs • Output: Two (2) twenty (20) GB files = 40GB Total • “Post-processing Workflow” • High throughput • ~410,000 serial jobs, runtimes less than 1 min • Output: 14,000 files totaling 12 GB • Use Pegasus-WMS, HTCondor, Globus • Calculates hazard for 1 location
Recent PSHA Research Calculation Using Workflows – CyberShake Study 13.4 • PSHA calculation exploring impact of alternative velocity models and alternative codes was defined as 1144 workflows • Parallel Strain Green Tensor (SGT) workflow on Blue Waters • No remote job submission support in Spring 2013 • Created basic workflow system using qsub dependencies • Post-processing workflow on Stampede • Pegasus-MPI-Cluster to manage serial jobs • Single job is submitted to Stampede • Master-worker paradigm; work is distributed to workers • Relatively little output, so writes were aggregated in master
CyberShake 13.45 Workflow Performance Metrics • April 17, 2013 – June 17, 2013 (62 days) • Running jobs 60% of the time • Blue Waters Usage – Parallel SGT calculations: • Average of 19,300 cores, 8 jobs • Average queue time: 401 sec (34% job runtime) • Stampede Usage – Many-task post-processing: • Average of 1,860 cores, 4 jobs • 470 million tasks executed (177 tasks/sec) • 21,912 jobs total • Average queue time: 422 sec (127% job runtime)
Constructing Workflow Capabilities using HPC System Services • Our workflows rely on multiple services • Remote gatekeeper • Remote gridftp server • Local gridftp server • Currently issues are detected via workflow failure • Helpful if there were automated checks that everything is operational • Additional factors • Valid X509 proxy • Available disk space
Why Define PSHA Calculations as Multiple Workflows ? • CyberShake Study 13.4 used 1144 workflows • Too many jobs to merge into 1 workflow • Currently, we create 1144 workflows and monitor submission via cronjob
Improving Workflow Capabilities • Workflow tools could provide higher-level organization • Specify set of workflows as a group • Workflows are submitted and monitored • Errors detected and group paused • Easy to obtain status, progress • Statistics are aggregated over the group
Summary of SCEC Workflows Needs • Workflows tools for large calculations on leadership-class systems that do not support remote job submission • Need capabilities to define job dependencies, manage failure and retry, track files using dbms • Light-weight workflow tools that can be included in scientific software releases • For unsophisticated users, the complexity of workflow tools should not outweigh their advantages • Methods to define workflows with interactive (people-in-the-loop) stages • Needed to automate computational pathways that have manual stages • Methods to gather both scientific and computational metadata into databases • Needed to ensure reproducibility and support user interfaces