CyberShake Study
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

CyberShake Study 14.2 Technical Readiness Review PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

CyberShake Study 14.2 Technical Readiness Review. Study 14.2 Scientific Goals. Compare impact of velocity models on Los Angeles-area hazard maps with various velocity models CVM-S4.26, BBP 1D, CVM-H 11.9, no GTL Compare to CVM-S, CVM-H 11.9 with GTL Investigate impact of GTL

Download Presentation

CyberShake Study 14.2 Technical Readiness Review

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cybershake study 14 2 technical readiness review

CyberShake Study 14.2

Technical Readiness Review


Study 14 2 scientific goals

Study 14.2 Scientific Goals

  • Compare impact of velocity models on Los Angeles-area hazard maps with various velocity models

    • CVM-S4.26, BBP 1D, CVM-H 11.9, no GTL

    • Compare to CVM-S, CVM-H 11.9 with GTL

      • Investigate impact of GTL

      • Compare 1D reference model

      • Compare tomographic inversion results

  • 286 sites (10 km mesh + points of interest)


Study 14 2 technical goals

Study 14.2 Technical Goals

  • Run both SGT and post-processing workflows on Blue Waters

  • Plan to measure CyberShake application makespan

    • Equivalent to the makespan of all of the workflows

      • (All jobs complete) – (first workflow submitted)

      • Includes hazard curve calculation time

      • Includes system downtime, workflow stoppages

  • Will estimate time-to-solution by adding estimate of setup-time and analysis time.

  • Compare performance, queue times, results of GPU and CPU AWP-ODC-SGT


Performance enhancements

Performance Enhancements

  • New version of seismogram synthesis code to reduce read I/O

    • Reads in set of extracted SGTs

    • Synthesizes multiple RVs (using 5 in production)

  • Reduce number of subworkflows to 6 (from 8)

    • Fewer jobs, less queuing time

  • For CPU SGTs, increase core count

    • Each processor has ~64x50x50 chunk of grid points

  • For GPU SGTs, decrease processor count

    • Volume must be multiple of 20 in X and Y

    • 10 x 10 x 1 GPUs, regardless of volume


Proposed study sites 286

Proposed Study sites (286)


Study 14 2 data products

Study 14.2 Data Products

  • 2 CVM-S4.26 Los Angeles-area hazard models

  • 1 BBP 1D Los Angeles-area hazard model

  • 1 CVM-H 11.9, no GTL Los Angeles-area hazard model

  • Hazard curves for 286 sites x 4 conditions, at 3s, 5s, 10s

  • 1144 sets of 2-component SGTs

  • Seismograms for all ruptures (~470M)

  • Peak amplitudes in DB for 3s, 5s, 10s


Study 14 2 notables

Study 14.2 Notables

  • First CVM-S4.26 hazard models

  • First CVM-H, no GTL hazard model

  • First 1D hazard model

  • First study using AWP-SGT-GPU

  • First CyberShake Study using a single workflow on one system (Blue Waters)


Study 14 2 parameters

Study 14.2 Parameters

  • 0.5 Hz, deterministic

    • 200 m spacing

  • CVMs

    • Vs min = 500 m/s

  • UCERF 2

  • Graves & Pitarka (2010) rupture variations


Verification

Verification

  • 4 sites (USC, PAS, WNGC, SBSM)

    • AWP-SGT-CPU, CVM-S4.26

    • AWP-SGT-GPU, CVM-S4.26

    • AWP-SGT-CPU, BBP 1D

    • AWP-SGT-GPU, CVM-H 11.9, no GTL

  • Plotted with previously calculated curves


Cvm s4 26 cpu

CVM-S4.26 (CPU)


Cvm h no gtl cpu

CVM-H, no GTL (CPU)


Changes to sgt software stack

Changes to SGT Software Stack

  • Velocity Mesh generation

    • Switched from 2 jobs (create, then merge) to 1 job

  • SGTs

    • AWP-ODC-SGT CPU v14.2

      • Has wrapper because of issue with getting exit code back

    • AWP-ODC-SGT GPU v14.2

      • Has wrapper to read in parameter file and construct command-line arguments

  • Nan Check

    • Always had NaN check for RWG SGTs, now for AWP SGTs also


Changes to pp software stack

Changes to PP Software Stack

  • Seismogram Synthesis / PSA Calculation

    • Modified to synthesize multiple seismograms per invocation

    • Will use 5 rupture variations per invocation

    • Reduces read I/O by factor of 5

    • Needed to avoid congestion protection events

  • All codes tagged in SVN before study begins


Changes to workflows

Changes to Workflows

  • Changed workflow hierarchy

    • 1 integrated workflow per site, per

    • Added ability to select SGT core count dynamically

    • Put volume creation job into top-level workflow to reduce hierarchy to 2 levels

  • Reduced number of post-processing sub-workflows to 6

    • Fewer jobs in queue

  • Will not keep job output if job succeeds

    • Reduce size of workflow logs


Workflow hierarchy

Workflow Hierarchy

Integrated Workflow

(1 per model per site)

PreCVM(creates volume)

SGT Workflow

Generate SGT Workflow

More details on next slide

PP Pre Workflow

PP subwf 0

PP subwf 1

PP subwf5

DB workflow


Cybershake study 14 2 technical readiness review

68000

68000

6


Distributed processing

Distributed Processing

  • Cron job on shock.usc.edu creates/plans/runs full workflows

    • Pegasus 4.4, from Git repository

    • Condor 8.0.3

    • Globus 5.0.4

  • Jobs submitted to Blue Waters via GRAM

  • Results staged back to shock, DB populated, curves generated

  • Alternate CPU and GPU workflows for best queue performance


Computational requirements

Computational Requirements

  • Computational time: 275K node-hrs

    • SGT Computational time: 180K node-hrs

      • CPU: 150 node-hrs/site x 286 sites x 2 models = 86K node-hrs (XE, 32 cores/node)

      • GPU: 90 node hrs/site x 286 sites x 2 models = 52K node-hrs (XK)

      • Study 13.4 had 29% overrun on SGTs

    • PP Computational time: 95K node-hrs

      • 60 node-hrs/site x 286 sites x 4 models= 70K node-hrs (XE, 32 cores/node)

      • Study 13.4 had 35% overrun on PP

  • Current allocation has 3.0M node-hrs remaining


Blue waters storage requirements

Blue Waters Storage Requirements

  • Planned unpurged disk usage: 45 TB

    • SGTs: 40 GB/site x 286 sites x 4 models= 45 TB, archived on Blue Waters

  • Planned purged disk usage: 783 TB

    • Seismograms: 11 GB/site x 286 sites x 4 models= 12.3 TB, staged back to SCEC

    • PSA files: 0.2 GB/site x 286 sites x 4 models= 0.2 TB, staged back to SCEC

    • Temporary: 690 GB/site x 286 sites x 4 models= 771 TB


Scec storage requirements

SCEC Storage Requirements

  • Planned archival disk usage: 12.5 TB

    • Seismograms: 12.3 TB (scec-04 has 19 TB)

    • PSA files: 0.2 TB (scec-04)

    • Curves, disagg, reports: 93 GB (99% reports)

  • Planned database usage: 210 GB

    • 3 rows/rupture variation x 410K rupture variations/site x 286 sites x 4 models = 1.4B rows

    • 1.4B rows x 151 bytes/row = 210 GB (880 GB free)

  • Planned temporary disk usage: 5.5 TB

    • Workflow logs: 5.5 TB – possibly smaller, not saving all output anymore (scec-02 has 12 TB free)


Metrics gathering

Metrics Gathering

  • Monitord for workflow metrics

    • Will run after workflows have completed

  • Python scripts

    • Used to obtain some of the standard CyberShake metrics for comparison

  • Cronjob on Blue Waters

    • Core usage over time

    • Jobs running and idle counts

  • Will use start and end of workflow logs to perform makespan measurement


Estimated duration

Estimated Duration

  • Limiting factors:

    • Queue time

      • Especially for XK nodes, could be substantial percentage of run time

    • Blue Waters -> SCEC transfer

      • If Blue Waters throughput is very high, transfer could be bottleneck

  • With queues, estimated completion is 4 weeks

    • 1 hazard map/week

    • Requires average of 410 nodes

    • 603 nodes averaged during Study 13.4

  • With a reservation, completion depends on the reservation size


Personnel support

Personnel Support

  • Scientists

    • Tom Jordan, Kim Olsen, Rob Graves

  • Technical Lead

    • Scott Callaghan

  • SGT code support

    • EfecanPoyraz, Yifeng Cui

  • Job Submission / Run Monitoring

    • Scott Callaghan, David Gill, Heming Xu, Phil Maechling

  • NCSA Support

    • Omar Padron, Tim Bouvet

  • Workflow Support

    • Karan Vahi, Gideon Juve


Risks

Risks

  • Queue times on Blue Waters

    • In tests, at times GPU queue times have been > 1 day

  • Congestion protection events

    • If triggered consistently, will either need to throttle post-processing or suspend run until improvements are developed


Thanks for your time

Thanks for your time!


  • Login