nasa nccs application performance discussion
Download
Skip this Video
Download Presentation
NASA NCCS APPLICATION PERFORMANCE DISCUSSION

Loading in 2 Seconds...

play fullscreen
1 / 83

NASA NCCS APPLICATION PERFORMANCE DISCUSSION - PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on

NASA NCCS APPLICATION PERFORMANCE DISCUSSION. Koushik Ghosh, Ph.D. IBM Federal HPC HPC Technical Specialist. IBM Systemx iDataPlex -Parallel Scientific Applications Development April 22-23, 2009 Koushik K Ghosh, Ph.D. IBM Federal HPC HPC Technical Specialist. Topics.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NASA NCCS APPLICATION PERFORMANCE DISCUSSION' - suzy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
nasa nccs application performance discussion
NASA NCCS APPLICATION PERFORMANCE DISCUSSION
  • Koushik Ghosh, Ph.D.
  • IBM Federal HPC
  • HPC Technical Specialist

IBM Systemx iDataPlex -Parallel Scientific Applications Development

April 22-23, 2009

Koushik K Ghosh, Ph.D.

IBM Federal HPC

HPC Technical Specialist

topics
Topics
  • HW SW System Architecture
    • Platform/Chipset
    • Processor
    • Memory
    • Interconnect
  • Building Apps on System x
    • Compilation
    • MPI
  • Executing Apps on System x
    • Runtime options
  • Tools: Oprofile / MIO I/O Perf / MPI Trace
  • Discussion of NCCS apps
compute node
Compute Node
  • iDataPlex – 2U Flex
    • Intel Harpertown (Xeon L5200)
      • dual-socket, quad-core 2.5 GHz 50W
      • SCU 3 SCU 4
    • Nehalem
      • dual-socket, quad-core 2.8? GHz
      • SCU 5
cpuinfo harpertown opt intel impi 3 1 bin64 cpuinfo
cpuinfo (Harpertown) (/opt/intel/impi/3.1/bin64/cpuinfo)
  • Architecture : x86_64
  • Hyperthreading: disabled
  • Packages : 2
  • Cores : 8
  • Processors : 8
  • ===== Processor identification =====
  • Processor Thread Core Package
  • 0 0 0 1
  • 1 0 0 0
  • 2 0 2 0
  • 3 0 2 1
  • 4 0 1 0
  • 5 0 3 0
  • 6 0 1 1
  • 7 0 3 1
  • ===== Processor placement =====
  • Package Cores Processors
  • 1 0,2,1,3 0,3,6,7
  • 0 0,2,1,3 1,2,4,5
  • ===== Cache sharing =====
  • Cache Size Processors
  • L1 32 KB no sharing
  • L2 6 MB (0,6)(1,4)(2,5)(3,7)
cat cpuinfo nehalem opt intel impi 3 2 0 011 bin64 cpuinfo
cat cpuinfo (Nehalem)(/opt/intel/impi/3.2.0.011/bin64/cpuinfo)
  • Architecture : x86_64
  • Hyperthreading: enabled
  • Packages : 2
  • Cores : 8
  • Processors : 16
  • ===== Processor identification =====
  • Processor Thread Core Package
  • 0 0 0 0
  • 1 1 0 0
  • 2 0 1 0
  • 3 1 1 0
  • 4 0 2 0
  • 5 1 2 0
  • 6 0 3 0
  • 7 1 3 0
  • 8 0 0 1
  • 9 1 0 1
  • 10 0 1 1
  • 11 1 1 1
  • 12 0 2 1
  • 13 1 2 1
  • 14 0 3 1
  • 15 1 3 1
  • ===== Processor placement =====
  • Package Cores Processors
  • 0 0,1,2,3 (0,1)(2,3)(4,5)(6,7)
  • 1 0,1,2,3 (8,9)(10,11)(12,13)(14,15)
  • ===== Cache sharing =====
  • Cache Size Processors
  • L1 32 KB (0,1)(2,3)(4,5)(6,7)(8,9)(10,11)(12,13)(14,15)
  • L2 256 KB (0,1)(2,3)(4,5)(6,7)(8,9)(10,11)(12,13)(14,15)
  • L3 8 MB (0,1,2,3,4,5,6,7)(8,9,10,11,12,13,14,15)
cat proc cpuinfo harpertown
cat /proc/cpuinfo (Harpertown)
  • processor : 0
  • vendor_id : GenuineIntel
  • cpu family : 6
  • model : 23
  • model name : Intel(R) Xeon(R) CPU E5472 @ 3.00GHz
  • stepping : 6
  • cpu MHz : 2992.509
  • cache size : 6144 KB
  • physical id : 1
  • siblings : 4
  • core id : 0
  • cpu cores : 4
  • fpu : yes
  • fpu_exception : yes
  • cpuid level : 10
  • wp : yes
  • flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
  • bogomips : 5988.95
  • clflush size : 64
  • cache_alignment : 64
  • address sizes : 38 bits physical, 48 bits virtual
  • power management:
proc cpuinfo nehalem
/proc/cpuinfo (Nehalem)
  • processor : 0
  • vendor_id : GenuineIntel
  • cpu family : 6
  • model : 26
  • model name : Intel(R) Xeon(R) CPU X55700 @ .
  • stepping : 4
  • cpu MHz : 2927.000
  • cache size : 8192 KB
  • physical id : 0
  • siblings : 8
  • core id : 0
  • cpu cores : 4
  • apicid : 0
  • initial apicid : 0
  • fpu : yes
  • fpu_exception : yes
  • cpuid level : 11
  • wp : yes
  • flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
  • bogomips : 5866.08
  • clflush size : 64
  • cache_alignment : 64
  • address sizes : 40 bits physical, 48 bits virtual
  • power management:
cat proc meminfo
MemTotal: 24737232 kB

MemFree: 21152912 kB

Buffers: 77376 kB

Cached: 2230344 kB

SwapCached: 0 kB

Active: 1650908 kB

Inactive: 1720616 kB

Active(anon): 955796 kB

Inactive(anon): 0 kB

Active(file): 695112 kB

Inactive(file): 1720616 kB

Unevictable: 0 kB

Mlocked: 0 kB

SwapTotal: 2104472 kB

SwapFree: 2104472 kB

Dirty: 536 kB

Writeback: 0 kB

AnonPages: 955608 kB

Mapped: 28632 kB

Slab: 123752 kB

SReclaimable: 101028 kB

SUnreclaim: 22724 kB

PageTables: 5364 kB

NFS_Unstable: 0 kB

Bounce: 0 kB

WritebackTmp: 0 kB

CommitLimit: 14473088 kB

Committed_AS: 1156568 kB

VmallocTotal: 34359738367 kB

VmallocUsed: 337244 kB

VmallocChunk: 34359395451 kB

HugePages_Total: 0

HugePages_Free: 0

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 2048 kB

DirectMap4k: 4096 kB

DirectMap2M: 25153536 kB

cat /proc/meminfo
meminfo explanation
Meminfo explanation
  • High-Level Statistics
  • MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code)
  • MemFree: Is sum of LowFree+HighFree (overall stat)
  • MemShared: 0; is here for compat reasons but always zero.
  • Buffers: Memory in buffer cache. mostly useless as metric nowadays
  • Cached: Memory in the pagecache (diskcache) minus SwapCache
  • SwapCache: Memory that once was swapped out, is swapped back in but still also is in the swapfile (if memory is needed it doesn't need to be swapped out AGAIN because it is already in the swapfile. This saves I/O)
memory
Memory
  • Memory on Harpertown Compute Node SCU3 and SCU4
    • 4 x 4GB (9W) PC2-5300 CL5 ECC DDR2 667MHz FBDIMMs
    • 16 GB per node
  • Memory on Nehalem Compute Node
    • 3 DDR3 channels on each socket / total of 8 DIMM slots
    • e.g. 4 GB DIMM on each DDR3 channel (24GB/node) 1300 MHz
    • e.g 18GB per node (1066 MHz)
      • 2GB/2GB/2GB on channel1
      • 2GB/2GB/2GB on channel2
      • 1GB on channel 3
interconnect
Interconnect
  • (1) Mellanox ConnectX dual port DDR IB 4X HCA PCIe 2.0 x8
  • IB4X DDR Cisco 9024D 288-port DDR switches for each scalable unit cabled in the following manner:
    • 256 ports to compute nodes
    • 2 ports to spare compute nodes
    • 6 ports to service nodes
    • 24 ports uplinked to Tier 1 InfiniBand switch
  • “ConnectX” InfiniBand 4X DDR HCAs
  • 16 Gb/second of uni-directional peak MPI bandwidth
  • less than 2 microseconds MPI latency.
nehalem features
Nehalem Features
  • The Nehalem microarchitecture has many new features, some of which are present in the Core i7. The ones that represent significant changes from the Core 2 include:
  • The new LGA 1366 socket is incompatible with earlier processors.
  • On-diememory controller: the memory is directly connected to the processor. It is called the uncore part and runs at a different clock (uncore clock) of execution cores.
    • Three channel memory: each channel can support one or two DDR3 DIMMs. Motherboards for Core i7 generally have three, four (3+1) or six DIMM slots.
    • Support for DDR3 only.
    • No ECC support.
  • The front side bus has been replaced by the Intel QuickPath Interconnect interface. Motherboards must use a chipset that supports QuickPath.
  • The following caches:
    • 32 KB L1 instruction and 32 KB L1 data cache per core
    • 256 KB L2 cache (combined instruction and data) per core
    • 8 MB L3 (combined instruction and data) "inclusive", shared by all cores
  • Single-die device: all four cores, the memory controller, and all cache are on a single die.
nehalem features contd
Nehalem Features contd.
  • "Turbo Boost" technology
    • allows all active cores to intelligently clock themselves up
    • in steps of 133 MHz over the design clock rate
    • as long as the CPU's predetermined thermal/electrical requirements are still met.
  • Re-implemented Hyper-threading.
    • Each of the four cores can process up to two threads simultaneously,
    • processor appears to the OS as eight CPUs.
    • This feature was dropped in Core (Harpertown).
  • Only one QuickPath interface: not intended for multi-processor motherboards.
  • 45nm process technology.
  • 731M transistors.
  • 263 mm2 Die size.
  • Sophisticated power management places unused core in a zero-power mode.
  • Support for SSE4.2 & SSE4.1 instruction sets.
i o and filesystem
I/O and Filesystem
  • /discover/home
  • /discover/nobackup
  • IBM Global Parallel File System (GPFS) used on all nodes
  • Serviced by 4 I/O nodes
  • read/write access from all nodes
  • /discover/home 2TB, /discover/nobackup 4 TB
    • Individual quota :•
      • /discover/home: 500 MB
      • /discover/nobackup: 100 GB
  • Very fast (peak: 350 MB/sec, normal: 150 - 250 MB/sec)
software
Software
  • OS: Linux (RHEL5.2)
  • Compilers:
    • Intel Fortran, C/C++
    • Math libs: BLAS, LAPACK, ScaLAPACK, MKL
  • MPI: MPI-2
  • Scheduler: PBSPro
linux pagesize
LINUX pagesize
  • getconf PAGESIZE

4096

which modules
Which modules
  • modules loaded (64-bit compilers)
    • intel-cce-10.1.017
    • intel-fce-10.1.017
    • intel-mkl-10.0.3.020
    • intel-mpi-3.1-64bit
    • /opt/intel/fce/10.1.017/bin/ifort
    • /opt/intel/impi/3.1/bin64/mpiifort
  • modules loaded (32-bit compilers)
    • intel-cc-10.1.017
    • intel-fc-10.1.017
    • /opt/intel/fc/10.1.017/bin/ifort
    • /opt/intel/impi/3.1/bin/mpiifort
ifc compiler options of some physics chemistry climate applications
IFC Compiler Options of some Physics/Chemistry/Climate Applications
  • CubedSphere
    • -safe_cray_ptr -i_dynamic -convert big_endian -assume byterecl -ftz -i4 -r8 -O3 -xS
  • NPB3.2 -O3 -xT -ip -no-prec-div -ansi-alias -fno-alias
  • HPCC -O2 –xT
  • GAMESS -O3 -xS -ipo -i-static -fno-pic –ipo
  • GTC -O1
  • CAM -O3 –xT
  • MILC -O3 –xT
  • PARATEC -O3 -xS -ipo -i-static-fno-fnalias -fno-alias
  • STREAM -O3 –opt-streaming-stores=always –xS-ip
  • SpecCPU2006 ?????
optimization level o2 o2
- Inlining of intrinsics- Intra-file interprocedural opt- inlining- constant propagation- forward substitution- routine attribute propagation- variable address-taken analysis- dead static function elimination- removal of unreferenced variables- constant propagation- copy propagation

dead-code elimination- global register allocation- global instruction scheduling and control speculation- loop unrolling- optimized code selection- partial redundancy elimination- strength reduction/induction

- variable renaming- exception handling optimizations- tail recursions- peephole optimizations- structure assignment lowering and optimizations- dead store elimination

Optimization Level O2 (-O2)
optimization level o3 o3
Optimization Level O3 (-O3)
  • Enables O2 optimizations plus more aggressive optimizations, such as
    • prefetching, scalar replacement
    • loop and memory access transformations.
  • Loop unrolling, including instruction scheduling
  • Code replication to eliminate branches
  • Padding the size of certain power-of-two arrays to allow more efficient cache use.
  • O3 optimizations may not cause higher performance unless loop and memory access transformations take place.
  • O3 optimizations may slow down code in some cases compared to O2 optimizations.
  • O3 option is recommended for
    • loops that heavily use floating-point calculations
    • Loops that process large data sets.
o2 vs o3
O2 vs. O3
  • O2 will get a significant amount of performance
  • Depends on code constructs, memory optimizations
  • Both of these should be experimented with
interprocedural optimizations ip
Interprocedural Optimizations (-Ip)
  • Interprocedural optimizations for single file compilation.
  • Subset of full intra-file interprocedural optimizations
  • e.g. Perform inline function expansion for calls to functions defined within the current source file.
interprocedural optimization ipo
Interprocedural Optimization (-ipo)
  • Multi-file ip optimizations that includes:- inline function expansion- interprocedural constant propogation- dead code elimination- propagation of function characteristics- passing arguments in registers- loop-invariant code motion
inlining
Inlining
  • -inline-level=
    • control inline expansion:
    • n=0 disable inlining
    • n=1 no inlining (unless -ip specified)
    • n=2 inline any function, at the compiler's discretion (same as -ip)
  • -f[no-]inline-functions
    • inline any function at the compiler's discretion
  • -finline-limit=
    • set maximum number of statements to be considered for inlining
  • -no-inline-min-size
    • no size limit for inlining small routines
  • -no-inline-max-size
    • no size limit for inlining large routines
did inlining ipo and pgo help
Did Inlining, IPO and PGO Help?
  • Use selectively on bottlenecks
  • Better for small chunks of code
the fast option
The –fast Option
  • Include options that can improve run-time performance:
  • -O3   (maximum speed and high-level optimizations)
  • -ipo (enables interprocedural optimizations across files)
  • -xT  (generate code specialized for Intel(R) Xeon(R) processors with SSE3, SSE4 etc.
  • -static  Statically link in libraries at link time
  • -no-prec-div (disable -prec-div) where -prec-div improves precision of FP divides (some speed impact)
sse and vectorization
SSE and Vectorization
  • -xT Intel(R) Core(TM)2 processor family with SSSE3
    • Use –xSSSE3
    • Harpertown
  • -xS Future Intel processors supporting
    • SSE4 Vectorizing Compiler Use –xSSE4.1
    • Media Accelerator instructions
  • -xsse4.2 for Nehalem processors (SSE4.2 instructions)
  • -xsse4.1 for Nehalem processors (SSE4.1 instructions)
what is sse4
What is SSE4
  • SSE Streaming SIMD Extensions (SSE SSE1 SSE2 SSE3)
  • SSSE3 Suplemental SSE
  • In SSE4.2, is first available in Core i7 (aka Nehalem)
    • consists of 54 instructions divided into two major categories:
    • Vectorizing Compiler and Media Accelerators
    • Efficient Accelerated String and Text Processing.
    • Graphics / Video encoding and processing / 3-D imaging / Gaming
    • High-performance applications .
    • Efficient Accelerated String and Text Processing will benefit database and data mining applications, and those that utilize parsing, search, and pattern matching algorithms like virus scanners and compilers.
  • A subset of 47 instructions, SSE4.1 in Penryn (Core 2) Harpertown
vectorization intra register vec
Vectorization (Intra register) -vec
  • void vecadd(float a[], float b[], float c[], int n)

{ int i;

for (i = 0; i < n; i++) {

c[i] = a[i] + b[i];

}

}

the Intel compiler will transform the loop to allow four floating-point additions to occur simultaneously using the addps instruction. Simply put, using a pseudo-vector notation, the result would look something like this:

for (i = 0; i < n; i+=4) {

c[i:i+3] = a[i:i+3] + b[i:i+3];

}

openmp
OpenMP
  • -openmp
    • generate multi-threaded code based on the OpenMP* directives
  • -openmp-profile
    • enable analysis of OpenMP application when
    • the Intel(R) Thread Profiler should be installed
  • -openmp-stubs
    • enables the user to compile OpenMP programs in sequential mode
    • OpenMP directives are ignored and a stub OpenMP library is linked
  • -openmp-report{0|1|2}
    • control the OpenMP parallelizer diagnostic level
auto parallel parallel
Auto Parallel (-parallel)
  • -parallel
  • generate multithreaded code for loops that can be safely executed in parallel.
  • Must use O2 or O3.
  • The default numbers of threads spawned is equal to the number of processors detected in the system where the binary is compiled
  • can be changed by setting the environment variable OMP_NUM_THREADS
  • -parallel-report is very useful
auto parallel experiment outcome
Auto Parallel Experiment Outcome
  • 8 cores
  • 6 MPI
  • OMP_NUM_THREADS=2
  • -stack_temps -safe_cray_ptr -i_dynamic -convert big_endian -assume byterecl -i4 -r8 -w95 -O3 -inline-level=2
    • Total runtime 415 seconds
  • -stack_temps -safe_cray_ptr -i_dynamic -convert big_endian -assume byterecl -ftz -i4 -r8 -w95 -O3 -inline-level=2 –parallel
    • Total runtime 594 seconds
  • Have to include –parallel in LDFLAGS
profile guided optimization pgo
Profile Guided Optimization (PGO)
  • Traditional static compilation model
  • Optimization decisions based on only an estimate of important execution characteristics.
  • Branch probabilities, are estimated by assuming
    • that controlling conditions that test equality are less likely to succeed than condition that test inequality.
  • Relative execution counts are based on static properties such as nesting depth.
  • These estimated execution characteristics are subsequently used to make optimization decisions
    • such as selecting an appropriate instruction layout,
    • procedure inlining
    • generating a sequential and vector version of a loop.
  • The quality of such decisions can substantially improve if more accurate execution characteristics are available, which becomes possible under profile-guided optimization.
pgo steps
PGO steps
  • Phase 1 (Compile)
    • mpiifort –O3 –prof-gen -prof-dir dirx
  • Phase 2 (Run code, collect profile)
    • run_CubedSphere_BMK2.sh > BMK2.out 2>&1
    • Produces .dyn files
      • comp_fv/49e5dea6_12099.dyn etc. etc.
  • Phase3 (Recompile)
    • mpiifort –O3 –prof-use –prof-dir dirx
    • ipo: remark #11000: performing multi-file optimizations
    • ipo-1: remark #11004: starting multi-object compilation
  • Phase 4 (Re-run code)
    • rerun_CubedSphere_BMK2.sh > BMK2.out 2>&1
pgo outcome mxm
PGO Outcome mxm
  • -O3 –prof-gen 96 seconds
  • -O3 –prof-use 10 seconds
  • -O2 27 seconds
optimization reports
Optimization Reports
  • -vec-report[n]
    • control amount of vectorizer diagnostic information
    • n=3 indicate vectorized/non-vectorized loops and prohibiting data dependence info
  • -opt-report [n]
    • generate an optimization report to stderr
    • n=3 maximum report output
  • -opt-report-file=
    • specify the filename for the generated report
  • -opt-report-routine=
    • reports on routines containing the given name
inter procedure optimization ipo
Inter procedure optimization (-ipo)
  • Multi-file ip optimizations that includes:- inline function expansion- interprocedural constant propogation- dead code elimination- propagation of function characteristics- passing arguments in registers- loop-invariant code motion
summary of mpi options
Summary of MPI options
  • Stacks available
    • OFED 1.3.1 / OFED 1.4.1
  • MPI Implementations
    • Intel MPI 3,2
    • Mvapich1 – 1.0.1
    • Mvapich2 – 1.2.6
    • OpenMPI 3.1
  • Compilers
    • Intel compilers
      • intel-cce-10.1.017
      • intel-fce-10.1.017
    • PGI
    • Pathscale
    • gcc
which mpi flavor
intel-mpi-3.1-64bit

intel-openmpi-1.2.6

intel-mvapich-1.0.1

intel-mvapich2-1.2rc2

gcc-openmpi-1.2.6

gcc-mvapich-1.0.1

gcc-mvapich2-1.2rc2

pathscale-openmpi-1.2.6

pathscale-mvapich-1.0.1

pathscale-mvapich2-1.2rc2

pgi-openmpi-1

pgi-mvapich-1.0.1

pgi-mvapich2-1.2rc2

ofed-1.4-pgi-openmpi-1.2.8

ofed-1.4-pgi-mvapich-1.1.0

ofed-1.4-pgi-mvapich2-1.2p1

ofed-1.4-gcc-openmpi-1.2.8

ofed-1.4-gcc-mvapich-1.1.0

ofed-1.4-gcc-mvapich2-1.2p1

ofed-1.4-pathscale-openmpi-1.2.8

ofed-1.4-pathscale-mvapich-1.1.0

ofed-1.4-pathscale-mvapich2-1.2p1

Which MPI flavor
open fabric enterprise distribution ofed 1 3 1 1 4
Open Fabric Enterprise Distribution OFED 1.3.1/1.4
  • The OpenFabrics Alliance software stacks OFED 1.3.1/1.4.x
  • Goal: develop, distribute and promote a
    • unified, transport-independent, open-source software stack
    • RDMA-capable fabrics and networks
    • InfiniBand and Ethernet
    • developed for many hardware architectures and OS
    • Linux and Windows.
    • server and storage clustering and grid connectivity using
    • optimized for performance (i.e., BW, low latency)
    • transport-offload technologies available in adapter hardware.
mvapich
MVAPICH
  • MVAPICH
    • (MPI-1 over OpenFabrics/Gen2, OpenFabrics/Gen2-UD, uDAPL, InfiniPath, VAPI and TCP/IP)
  • MPI-1 implementation
  • Based on MPICH and MVICH
  • The latest release is MVAPICH 1.1 (includes MPICH 1.2.7).
  • It is available under BSD licensing.
mvapich2
MVAPICH2
  • MVAPICH2
  • MPI-2 over OpenFabrics-IB, OpenFabrics-iWARP, uDAPL and TCP/IP
  • MPI-2 implementation which includes all MPI-1 features.
  • Based on MPICH2 and MVICH.
  • The latest release is MVAPICH2 1.2 (includes MPICH2 1.0.7).
open mpi version 1 3 1
Open MPI: Version 1.3.1
  • http://www.open-mpi.org
  • High performance message passing library
  • Open source MPI-2 implementation
  • Developed and maintained by a consortium of academic, research, and industry partners
  • Many OS supported
mpiexec options
MPIEXEC options
  • Two major areas:
  • DEVICE
  • PINNING
runtime mpi issues
RUNTIME MPI Issues
  • shm
    • Shared-memory only (no sockets)
  • ssm
    • Combined sockets + shared memory (for clusters with SMP nodes)
  • rdma
    • RDMA-capable network fabrics including InfiniBand*, Myrinet* (via DAPL*)
  • rdssm
    • Combined sockets + shared memory + DAPL*
    • for clusters with SMPnodes and RDMA-capable network fabrics
typical mpiexec command
Typical mpiexec command
  • mpiexec -genv I_MPI_DEVICE rdssm \

-genv –I _MPI_PIN 1 \

-genv I_MPI_PIN_PROCESSOR_LIST 0,2-3,4 \

–np 16 –perhost 4 a.out

  • -genv X Y : associate env var X with value Y for all MPI ranks.
mpi device for cubedsphere
MPI_DEVICE for CubedSphere
  • ssm
    • 250 seconds
  • rdma
    • 250 seconds
  • rdssm
    • 250 seconds
task affinity
Task Affinity
  • Taskset
    • Taskset -c 0,1,4,5 …….
  • Numacntrl
interactive tools for monitoring
Interactive Tools for Monitoring
  • top
  • mpstat
  • vmstat
  • iostat
slide63
SMT
  • Bios option set at boot time
  • Run 2 threads at the same time per core
  • Share resources (execution units)
  • Take advantage of 4-wide execution engine
  • Keep it fed with multiple threads
  • Hide latency of a single thread
  • Most power efficient performance feature
  • Very low die area cost
  • Can provide significant performance benefit depending on application
  • Much more efficient than adding an entire core
  • Implications for Out of Order executions
  • Might be good for MPI + OpenMP
  • Might lead to extra BW pressure and pressure on L1 L2 L3 caches
smt mpi
SMT + MPI
  • NOAA NCEP GFS code T190 (240 hour simulation)
  • SMT OFF
    • 9709 seconds
  • SMT ON TURBO ON
    • 7276 seconds
turbo
TURBO
  • Turbo mode boosts operating frequency based on thermal headroom
  • when the processor is operating below its peak power,
    • increase the clock speed of the active cores by one or more bins to increase performance.
  • Common reasons for operating below peak power are
    • one or more cores may be powered down
    • the active workload is relatively power (e.g. no floating point, or few memory accesses).
  • Active cores can increase their clock frequency in relatively coarse increments of 133MHz speed bins,
    • depending on the SKU
    • the available power
    • thermal headroom
    • other environmental factors.
smt hybrid mpi openmp
SMT + Hybrid MPI + OpenMP
  • 8 MPI tasks
  • OMP_NUM_THREADS=1
  • OMP_NUM_THREADS=2
  • Potentially a good way to exploit SMP
mpi optimization
MPI optimization
  • Affinity
  • Mapping Tasks to Nodes
  • Mapping Tasks to Cores
  • Barriers
  • Collectives
  • Environment variables
  • Partially/Fully loaded nodes
shm vs ssm
SHM vs. SSM
  • shm: 411 seconds
  • ssm: 424 seconds
events available for oprofile
Events available for Oprofile
  • CPU_CLK_UNHALTED:
  • UNHALTED_REFERENCE_CYCLES:
  • INST_RETIRED_ANY_P:
  • LLC_MISSES:
  • LLC_REFS:
  • BR_MISS_PRED_RETIRED:
papi issues
PAPI issues
  • Agree 100% that performance tools are desperately needed
  • Our LTC team has been actively driving the distros to add support.
  • End 2009, a decision was made to drive perfmon2 as the preferred method
  • Have had some success in driving into next major releases of REHEL6 and SLES11
  • Unfortunately, (possibly) we missed the first release of SLES11 and it will be in SP1
  • This would be the first time we could officially support it installed.
  • Run with the kernel patch, problems have to be reproduced on a non-patched system.
  • This is has worked on POWER Linux users at some pretty large sites.
  • Use TDS systems as the vehicle to have the patches and do some perf testing.
  • SCU5 without the PAPI patch and SCU6 with??.
  • If a kernel problem occurs that needs to be reproduced, it could just be rerun on SCU5??
oprofile
Oprofile
  • set CUR_DIR = `pwd`
  • sudo rm -rf samples
  • echo "*** shutdown ***"
  • sudo opcontrol --shutdown
  • echo "*** start-deamon ***"
  • sudo opcontrol --verbose=all --start-daemon --no-vmlinux --session-dir=${CUR_DIR} --separate=thread --callgraph=10 --event=LLC_REFS:10000 --image=$EXE
  • sudo opcontrol --status
  • echo "*** start ***"
  • sudo opcontrol --start
  • setenv OMP_NUM_THREADS 1
  • mpiexec -genv I_MPI_DEVICE shm -perhost 8 -n ${NUMPRO} ${EXE}
  • sudo opcontrol --stop
  • echo "*** shutdown ***"
  • sudo opcontrol --shutdown
i o optimization
I/O optimization
  • High Performance Filesystems
    • Striped disks
    • GPFS (parallel filesystem)
  • MIO
miostat statistics collection
MIOSTAT Statistics Collection
  • set MIOSTAT = /home/kghosh/vmio/tools/bin/miostats
  • $MIOSTAT -v ./c2l.x
mio optimized code execution
MIO optimized code execution
  • setenv MIO /home/kghosh/vmio/tools
  • setenv MIO_LIBRARY_PATH $MIO"/BLD/xLinux.64/lib"
  • setenv LD_PRELOAD $MIO"/BLD/xLinux.64/lib/libTKIO.so"
  • setenv TKIO_ALTLIBX "fv.x[$MIO/BLD/xLinux.64/lib/get_MIO_ptrs_64.so/abort]"
  • setenv MIO_STATS "./MIO.%{PID}.stats"
  • setenv MIO_FILES "*.nc [\

trace/stats/mbytes |\

pf/cache=2g/page=2m/pref=2 | \

trace/stats/mbytes|\

async/nthread=2/naio=100/nchild=1 |\

trace/stats/mbytes]"

mio with c2l cubetolatlon
MIO with C2L (CubeToLatLon)
  • BEFORE
  • Timestamp @ Start 14:00:45 Cumulative time 0.000 sec
  • Timestamp @ Stop 14:08:25 Cumulative time 460.451 sec
  • AFTER
  • MIO_FILES ="*.nc [

trace/stats/mbytes |

pf/cache=2g/page=2m/pref=2 |

trace/stats/mbytes|

async/nthread=2/naio=100/nchild=1 |

trace/stats/mbytes]"

  • Timestamp @ Start 14:31:53 Cumulative time 0.004sec
  • [email protected] Stop 14:34:04 Cumulative time130.618 sec
gpfs i o
GPFS I/O
  • Timestamp @ Start 10:14:44 Cumulative time 0.012 sec
  • Timestamp @ Stop 10:15:12 Cumulative time 27.835 sec
non invasive mpi trace tool from ibm
Non invasive MPI Trace Tool from IBM
  • No recompile needed
  • Uses PMPI layer
  • mpiifort -$(LDFLAGS) libmpi_trace.a *.o –o a.out
mpi trace output
MPI TRACE output
  • Data for MPI rank 62 of 128:
  • -----------------------------------------------------------------
  • MPI Routine #calls avg. bytes time(sec)
  • -----------------------------------------------------------------
  • MPI_Comm_size 1 0.0 0.000
  • MPI_Comm_rank 1 0.0 0.000
  • MPI_Isend 114554 4106.8 0.953
  • MPI_Irecv 114554 4117.5 0.188
  • MPI_Wait 229108 0.0 5.190
  • MPI_Bcast 28 11.1 0.039
  • MPI_Barrier 2 0.0 0.003
  • MPI_Reduce 2 8.0 0.000
  • -----------------------------------------------------------------
  • MPI task 62 of 128 had the median communication time.
  • total communication time = 6.373 seconds.
  • total elapsed time = 34.825 seconds.
  • user cpu time = 34.799 seconds.
  • system time = 0.002 seconds.
  • maximum memory size = 30120 KBytes.
mpi trace output82
MPI TRACE OUTPUT
  • Message size distributions:
  • MPI_Isend #calls avg. bytes time(sec)
  • 114252 4096.0 0.911
  • 302 8192.0 0.042
  • MPI_Irecv #calls avg. bytes time(sec)
  • 113954 4096.0 0.186
  • 600 8192.0 0.002
  • MPI_Bcast #calls avg. bytes time(sec)
  • 24 4.0 0.034
  • 2 8.0 0.006
  • 2 100.0 0.000
  • MPI_Reduce #calls avg. bytes time(sec)
  • 2 8.0 0.000
ad