Parallel Code Choices

Parallel Code Choices

Where We Stand? • ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken-XT5 and Ranger at full machine scale, Hercules successful test run on 16k Kraken-XT4 with Vs=200m/s • Multiple AWP-Olsen ShakeOut-D 1-Hz runs on NICS Kraken-XT5 using 64k processor cores, Wall Clock Time less than 5 hours, SORD using 16k Ranger cores • Milestone to pass 100k mark! Recent successful benchmark runs on DOE ANL BG/P using up to 131,072 cores

SCEC capability runs update

HPC initiative: short-term medium-term long-term Data integration HPC Initiative Tier0: PFlops class Current Parall Current parallel programming model Message passing C, C++, Fortran Plus MPI Communication Tier1: TG/DOE Supercomputer Centers Grid Computing SO-1Hz,Vs=200m/s Ranger, Kraken, BG/P Tier2: RegionalMedium Ranger Supercomputers Transition Model PGAS UPC, CAF, Titanium Current compilation technology Tier 3: High Performance Workstations Adaption, EGM 2-Hz, Vs=200m/s GPU/Cell, Blue Water, Hybrid, NUMA, CAF High Productivity Models HPCS X-10 Chapel Future compilation technology Ph.D programs? Pick up new codes, EGM 3-10Hz, Contribute to architecture design Future Architectures, FPGA, Chapel, Cloud computing … 2009 2011 2013…

Parallel FD and FE Codes

Proposed Plan of Work:Automatic End-to-End Approach • Automated rule-based workflow • Highly configurable and customizable • Reliable and robust • Easy implementation

Proposed Plan of Work:Single-core Optimization • Target much higher TeraFlop/s! basic but most important optimization step due to the accumulated performance gains even in multi-core environments • Application specific optimization techniques • Program behavior analysis (source level or run-time profiling) • various traditional optimization techniques such as loop unrolling, code reassignment, register reallocation and so on • Optimize the behaviors of the code hotspot • Architecture aware optimization • Optimization based on the underlying architecture: computational unit, interconnect, cache and memory • Compiler driven optimization techniques, some already done • Optimal compiler and optimization flags • Optimal libraries

Proposed Plan of Work:Multi-core Optimization SYNC sender receiver • Computational pipelining • Asynchronous process communication • isend and irecv • Well-defined pipelines computational jobs to reduce the overhead imposed by the MPI synchronization • Guaranteed correctness of the computation • Reduction of conflicts on shared resources • A computational node shares resources: Caches (Shared L2 or L3) and Memory • Resolves highly biased conflicts on shared resources • program behavioral solutions through temporal or spatial conflict avoidance send stall Sync point recv ASYNC sender receiver isend computation irecv Core2 Core1 L1 Cache L1 Cache Shared L2 cache frequent&biased Infrequent&even Shared Memory

Proposed Plan of Work:Fault Tolerance • Full systems are being designed with 500,000 processors… • Assuming 99.99% each processor to continue functioning for 1 year, the chance of one million-core machine remaining up for one week is 14% • Checkpointing and restarting could take longer than the time to the next failure • System checkpoint/restart under way • Last year, our 80+ hours 6k core run on BG/L successful using IBM system checkpoint (application-assisted infrastructure, application level responsible for identifying point in which there are no outstanding messages. • New model needed, checkpoints to disk will be impractical at exascale • Collaboration with Dr. Zizheng Chen of CSM • Scalable algorithm-based checkpoint-free techniques to tolerate a small number of process failures, level fault tolerance solution

Proposed Plan of Work:Data Management • Centralized data-collection more and more difficult, as data size increases exponentially • Automate administrative tasks huge challenge such as replication, distribution, access controls, metadata extraction. Data virtualization and grid technology to be integrated. With iRODS,for example, can write rules to track administrative functions such as integrity monitoring • - provide logical name space so the data can be moved without the access name changing - provide metadata to support discovery of files and track provenance - provide rules to automate administrative tasks (authenticity, integrity, distribution, replication checks) - provide micro-services for parsing data sets (HDF5 routines). • Potential to use new iRODS interface to serve large SCEC community • - WebDAV (possible to access from such as iPhone) • - Windows browser; efficient and fast browser interface

Proposed Plan of Work:Data Visualization • Visualization integration as critical interest, Amit has been working with a graduate student to develop GPU based new techniques for earthquake visualization

Candidates of Non-SCEC Applications • Ader-DG: An FE arbitrary high-order discontinuous Galerkin method • Shuo Ma’s FE code (MaFE) using simplified structured grid

AWP-Olsen-Day vs ADER-DG

ADER-DG Scaling on Ranger

ADER-DG Validation LOH.3 (Source: Martin Kaeser 2009)

ADER-DG Local Time Stepping Each tetrahedral element (m) has its own time step where lmin is the insphere radius of the tetrahedron and amax is the fastest wave speed. Therefore, the Taylor series in time depends on the local time level t(m) (Source: Martin Kaeser 2009)

ADER-DG Dynamic Rupture Results (Source: Martin Kaeser 2009)

ADER-DG Effect of mesh coarsening (Source: Martin Kaeser 2009)

DG Application to Landers branching fault system (Source: Martin Kaeser 2009)

DG Modeling of Wave Fields in Merapi Volcano (J. Wassermann) • problem adapted mesh generation • p-adaptive calculations to resolve • topography very accurately • load balancing by grouping subdomains (Source: Martin Kaeser 2009)

DG Modeling of Scattered Waves in Merapi Volcano (J. Wassermann) • analysing strong scattering effect of surface topography • analysing the limits of standard moment tensor inversion procedures (Source: Martin Kaeser 2009)

MaFE Scaling (Source: Shuo Ma 2009)

Parallel Code Choices

Parallel Code Choices

Presentation Transcript

Choices

Choices

CHOICES

Consumer Choices - Payment Choices

Choices, Choices, Choices

Choices

Choices

Benchmarking Parallel Code

Choices

Choices

Choices

Choices

Code Optimization of Parallel Programs

Choices

Choices

Choices, Choices

Choices, Choices

Parallel Computing Explained How to Parallelize a Code

The FLASH Code Parallel AMR Made Easy

Parallel Computing Explained Parallel Code Tuning

MPI and Parallel Code Support

Choices, Choices, Choices