NASA NCCS APPLICATION PERFORMANCE DISCUSSION

NASA NCCS APPLICATION PERFORMANCE DISCUSSION • Koushik Ghosh, Ph.D. • IBM Federal HPC • HPC Technical Specialist IBM Systemx iDataPlex -Parallel Scientific Applications Development April 22-23, 2009 Koushik K Ghosh, Ph.D. IBM Federal HPC HPC Technical Specialist

Topics • HW SW System Architecture • Platform/Chipset • Processor • Memory • Interconnect • Building Apps on System x • Compilation • MPI • Executing Apps on System x • Runtime options • Tools: Oprofile / MIO I/O Perf / MPI Trace • Discussion of NCCS apps

Scalable Unit Summary

2 SCU Configuration

SDR/DDR/QDR

iDataPlex footprint

Compute Node • iDataPlex – 2U Flex • Intel Harpertown (Xeon L5200) • dual-socket, quad-core 2.5 GHz 50W • SCU 3 SCU 4 • Nehalem • dual-socket, quad-core 2.8? GHz • SCU 5

Harpertown – Seaburg Chipset

Harpertown Intel® Core™2 Quad processor

Nehalem Tylersburg Chipset

Nehalem Intel® Core i7 Processor

Nehalem QPI Quick Path Interconnect

Cache Details

cpuinfo (Harpertown) (/opt/intel/impi/3.1/bin64/cpuinfo) • Architecture : x86_64 • Hyperthreading: disabled • Packages : 2 • Cores : 8 • Processors : 8 • ===== Processor identification ===== • Processor Thread Core Package • 0 0 0 1 • 1 0 0 0 • 2 0 2 0 • 3 0 2 1 • 4 0 1 0 • 5 0 3 0 • 6 0 1 1 • 7 0 3 1 • ===== Processor placement ===== • Package Cores Processors • 1 0,2,1,3 0,3,6,7 • 0 0,2,1,3 1,2,4,5 • ===== Cache sharing ===== • Cache Size Processors • L1 32 KB no sharing • L2 6 MB (0,6)(1,4)(2,5)(3,7)

cat cpuinfo (Nehalem)(/opt/intel/impi/3.2.0.011/bin64/cpuinfo) • Architecture : x86_64 • Hyperthreading: enabled • Packages : 2 • Cores : 8 • Processors : 16 • ===== Processor identification ===== • Processor Thread Core Package • 0 0 0 0 • 1 1 0 0 • 2 0 1 0 • 3 1 1 0 • 4 0 2 0 • 5 1 2 0 • 6 0 3 0 • 7 1 3 0 • 8 0 0 1 • 9 1 0 1 • 10 0 1 1 • 11 1 1 1 • 12 0 2 1 • 13 1 2 1 • 14 0 3 1 • 15 1 3 1 • ===== Processor placement ===== • Package Cores Processors • 0 0,1,2,3 (0,1)(2,3)(4,5)(6,7) • 1 0,1,2,3 (8,9)(10,11)(12,13)(14,15) • ===== Cache sharing ===== • Cache Size Processors • L1 32 KB (0,1)(2,3)(4,5)(6,7)(8,9)(10,11)(12,13)(14,15) • L2 256 KB (0,1)(2,3)(4,5)(6,7)(8,9)(10,11)(12,13)(14,15) • L3 8 MB (0,1,2,3,4,5,6,7)(8,9,10,11,12,13,14,15)

cat /proc/cpuinfo (Harpertown) • processor : 0 • vendor_id : GenuineIntel • cpu family : 6 • model : 23 • model name : Intel(R) Xeon(R) CPU E5472 @ 3.00GHz • stepping : 6 • cpu MHz : 2992.509 • cache size : 6144 KB • physical id : 1 • siblings : 4 • core id : 0 • cpu cores : 4 • fpu : yes • fpu_exception : yes • cpuid level : 10 • wp : yes • flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm • bogomips : 5988.95 • clflush size : 64 • cache_alignment : 64 • address sizes : 38 bits physical, 48 bits virtual • power management:

/proc/cpuinfo (Nehalem) • processor : 0 • vendor_id : GenuineIntel • cpu family : 6 • model : 26 • model name : Intel(R) Xeon(R) CPU X55700 @ . • stepping : 4 • cpu MHz : 2927.000 • cache size : 8192 KB • physical id : 0 • siblings : 8 • core id : 0 • cpu cores : 4 • apicid : 0 • initial apicid : 0 • fpu : yes • fpu_exception : yes • cpuid level : 11 • wp : yes • flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid • bogomips : 5866.08 • clflush size : 64 • cache_alignment : 64 • address sizes : 40 bits physical, 48 bits virtual • power management:

MemTotal: 24737232 kB MemFree: 21152912 kB Buffers: 77376 kB Cached: 2230344 kB SwapCached: 0 kB Active: 1650908 kB Inactive: 1720616 kB Active(anon): 955796 kB Inactive(anon): 0 kB Active(file): 695112 kB Inactive(file): 1720616 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 2104472 kB SwapFree: 2104472 kB Dirty: 536 kB Writeback: 0 kB AnonPages: 955608 kB Mapped: 28632 kB Slab: 123752 kB SReclaimable: 101028 kB SUnreclaim: 22724 kB PageTables: 5364 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 14473088 kB Committed_AS: 1156568 kB VmallocTotal: 34359738367 kB VmallocUsed: 337244 kB VmallocChunk: 34359395451 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 4096 kB DirectMap2M: 25153536 kB cat /proc/meminfo

Meminfo explanation • High-Level Statistics • MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) • MemFree: Is sum of LowFree+HighFree (overall stat) • MemShared: 0; is here for compat reasons but always zero. • Buffers: Memory in buffer cache. mostly useless as metric nowadays • Cached: Memory in the pagecache (diskcache) minus SwapCache • SwapCache: Memory that once was swapped out, is swapped back in but still also is in the swapfile (if memory is needed it doesn't need to be swapped out AGAIN because it is already in the swapfile. This saves I/O)

Memory • Memory on Harpertown Compute Node SCU3 and SCU4 • 4 x 4GB (9W) PC2-5300 CL5 ECC DDR2 667MHz FBDIMMs • 16 GB per node • Memory on Nehalem Compute Node • 3 DDR3 channels on each socket / total of 8 DIMM slots • e.g. 4 GB DIMM on each DDR3 channel (24GB/node) 1300 MHz • e.g 18GB per node (1066 MHz) • 2GB/2GB/2GB on channel1 • 2GB/2GB/2GB on channel2 • 1GB on channel 3

Interconnect • (1) Mellanox ConnectX dual port DDR IB 4X HCA PCIe 2.0 x8 • IB4X DDR Cisco 9024D 288-port DDR switches for each scalable unit cabled in the following manner: • 256 ports to compute nodes • 2 ports to spare compute nodes • 6 ports to service nodes • 24 ports uplinked to Tier 1 InfiniBand switch • “ConnectX” InfiniBand 4X DDR HCAs • 16 Gb/second of uni-directional peak MPI bandwidth • less than 2 microseconds MPI latency.

Nehalem Features • The Nehalem microarchitecture has many new features, some of which are present in the Core i7. The ones that represent significant changes from the Core 2 include: • The new LGA 1366 socket is incompatible with earlier processors. • On-diememory controller: the memory is directly connected to the processor. It is called the uncore part and runs at a different clock (uncore clock) of execution cores. • Three channel memory: each channel can support one or two DDR3 DIMMs. Motherboards for Core i7 generally have three, four (3+1) or six DIMM slots. • Support for DDR3 only. • No ECC support. • The front side bus has been replaced by the Intel QuickPath Interconnect interface. Motherboards must use a chipset that supports QuickPath. • The following caches: • 32 KB L1 instruction and 32 KB L1 data cache per core • 256 KB L2 cache (combined instruction and data) per core • 8 MB L3 (combined instruction and data) "inclusive", shared by all cores • Single-die device: all four cores, the memory controller, and all cache are on a single die.

Nehalem Features contd. • "Turbo Boost" technology • allows all active cores to intelligently clock themselves up • in steps of 133 MHz over the design clock rate • as long as the CPU's predetermined thermal/electrical requirements are still met. • Re-implemented Hyper-threading. • Each of the four cores can process up to two threads simultaneously, • processor appears to the OS as eight CPUs. • This feature was dropped in Core (Harpertown). • Only one QuickPath interface: not intended for multi-processor motherboards. • 45nm process technology. • 731M transistors. • 263 mm2 Die size. • Sophisticated power management places unused core in a zero-power mode. • Support for SSE4.2 & SSE4.1 instruction sets.

I/O and Filesystem • /discover/home • /discover/nobackup • IBM Global Parallel File System (GPFS) used on all nodes • Serviced by 4 I/O nodes • read/write access from all nodes • /discover/home 2TB, /discover/nobackup 4 TB • Individual quota :• • /discover/home: 500 MB • /discover/nobackup: 100 GB • Very fast (peak: 350 MB/sec, normal: 150 - 250 MB/sec)

Software • OS: Linux (RHEL5.2) • Compilers: • Intel Fortran, C/C++ • Math libs: BLAS, LAPACK, ScaLAPACK, MKL • MPI: MPI-2 • Scheduler: PBSPro

LINUX pagesize • getconf PAGESIZE 4096

Which modules • modules loaded (64-bit compilers) • intel-cce-10.1.017 • intel-fce-10.1.017 • intel-mkl-10.0.3.020 • intel-mpi-3.1-64bit • /opt/intel/fce/10.1.017/bin/ifort • /opt/intel/impi/3.1/bin64/mpiifort • modules loaded (32-bit compilers) • intel-cc-10.1.017 • intel-fc-10.1.017 • /opt/intel/fc/10.1.017/bin/ifort • /opt/intel/impi/3.1/bin/mpiifort

IFC Compiler Options of some Physics/Chemistry/Climate Applications • CubedSphere • -safe_cray_ptr -i_dynamic -convert big_endian -assume byterecl -ftz -i4 -r8 -O3 -xS • NPB3.2 -O3 -xT -ip -no-prec-div -ansi-alias -fno-alias • HPCC -O2 –xT • GAMESS -O3 -xS -ipo -i-static -fno-pic –ipo • GTC -O1 • CAM -O3 –xT • MILC -O3 –xT • PARATEC -O3 -xS -ipo -i-static-fno-fnalias -fno-alias • STREAM -O3 –opt-streaming-stores=always –xS-ip • SpecCPU2006 ?????

- Inlining of intrinsics- Intra-file interprocedural opt- inlining- constant propagation- forward substitution- routine attribute propagation- variable address-taken analysis- dead static function elimination- removal of unreferenced variables- constant propagation- copy propagation dead-code elimination- global register allocation- global instruction scheduling and control speculation- loop unrolling- optimized code selection- partial redundancy elimination- strength reduction/induction - variable renaming- exception handling optimizations- tail recursions- peephole optimizations- structure assignment lowering and optimizations- dead store elimination Optimization Level O2 (-O2)

Optimization Level O3 (-O3) • Enables O2 optimizations plus more aggressive optimizations, such as • prefetching, scalar replacement • loop and memory access transformations. • Loop unrolling, including instruction scheduling • Code replication to eliminate branches • Padding the size of certain power-of-two arrays to allow more efficient cache use. • O3 optimizations may not cause higher performance unless loop and memory access transformations take place. • O3 optimizations may slow down code in some cases compared to O2 optimizations. • O3 option is recommended for • loops that heavily use floating-point calculations • Loops that process large data sets.

O2 vs. O3 • O2 will get a significant amount of performance • Depends on code constructs, memory optimizations • Both of these should be experimented with

Interprocedural Optimizations (-Ip) • Interprocedural optimizations for single file compilation. • Subset of full intra-file interprocedural optimizations • e.g. Perform inline function expansion for calls to functions defined within the current source file.

Interprocedural Optimization (-ipo) • Multi-file ip optimizations that includes:- inline function expansion- interprocedural constant propogation- dead code elimination- propagation of function characteristics- passing arguments in registers- loop-invariant code motion

Inlining • -inline-level=<n> • control inline expansion: • n=0 disable inlining • n=1 no inlining (unless -ip specified) • n=2 inline any function, at the compiler's discretion (same as -ip) • -f[no-]inline-functions • inline any function at the compiler's discretion • -finline-limit=<n> • set maximum number of statements to be considered for inlining • -no-inline-min-size • no size limit for inlining small routines • -no-inline-max-size • no size limit for inlining large routines

Did Inlining, IPO and PGO Help? • Use selectively on bottlenecks • Better for small chunks of code

The –fast Option • Include options that can improve run-time performance: • -O3 (maximum speed and high-level optimizations) • -ipo (enables interprocedural optimizations across files) • -xT (generate code specialized for Intel(R) Xeon(R) processors with SSE3, SSE4 etc. • -static Statically link in libraries at link time • -no-prec-div (disable -prec-div) where -prec-div improves precision of FP divides (some speed impact)

SSE and Vectorization • -xT Intel(R) Core(TM)2 processor family with SSSE3 • Use –xSSSE3 • Harpertown • -xS Future Intel processors supporting • SSE4 Vectorizing Compiler Use –xSSE4.1 • Media Accelerator instructions • -xsse4.2 for Nehalem processors (SSE4.2 instructions) • -xsse4.1 for Nehalem processors (SSE4.1 instructions)

What is SSE4 • SSE Streaming SIMD Extensions (SSE SSE1 SSE2 SSE3) • SSSE3 Suplemental SSE • In SSE4.2, is first available in Core i7 (aka Nehalem) • consists of 54 instructions divided into two major categories: • Vectorizing Compiler and Media Accelerators • Efficient Accelerated String and Text Processing. • Graphics / Video encoding and processing / 3-D imaging / Gaming • High-performance applications . • Efficient Accelerated String and Text Processing will benefit database and data mining applications, and those that utilize parsing, search, and pattern matching algorithms like virus scanners and compilers. • A subset of 47 instructions, SSE4.1 in Penryn (Core 2) Harpertown

Vectorization (Intra register) -vec • void vecadd(float a[], float b[], float c[], int n) { int i; for (i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } the Intel compiler will transform the loop to allow four floating-point additions to occur simultaneously using the addps instruction. Simply put, using a pseudo-vector notation, the result would look something like this: for (i = 0; i < n; i+=4) { c[i:i+3] = a[i:i+3] + b[i:i+3]; }

OpenMP • -openmp • generate multi-threaded code based on the OpenMP* directives • -openmp-profile • enable analysis of OpenMP application when • the Intel(R) Thread Profiler should be installed • -openmp-stubs • enables the user to compile OpenMP programs in sequential mode • OpenMP directives are ignored and a stub OpenMP library is linked • -openmp-report{0|1|2} • control the OpenMP parallelizer diagnostic level

Auto Parallel (-parallel) • -parallel • generate multithreaded code for loops that can be safely executed in parallel. • Must use O2 or O3. • The default numbers of threads spawned is equal to the number of processors detected in the system where the binary is compiled • can be changed by setting the environment variable OMP_NUM_THREADS • -parallel-report is very useful

Auto Parallel Experiment Outcome • 8 cores • 6 MPI • OMP_NUM_THREADS=2 • -stack_temps -safe_cray_ptr -i_dynamic -convert big_endian -assume byterecl -i4 -r8 -w95 -O3 -inline-level=2 • Total runtime 415 seconds • -stack_temps -safe_cray_ptr -i_dynamic -convert big_endian -assume byterecl -ftz -i4 -r8 -w95 -O3 -inline-level=2 –parallel • Total runtime 594 seconds • Have to include –parallel in LDFLAGS

Profile Guided Optimization (PGO) • Traditional static compilation model • Optimization decisions based on only an estimate of important execution characteristics. • Branch probabilities, are estimated by assuming • that controlling conditions that test equality are less likely to succeed than condition that test inequality. • Relative execution counts are based on static properties such as nesting depth. • These estimated execution characteristics are subsequently used to make optimization decisions • such as selecting an appropriate instruction layout, • procedure inlining • generating a sequential and vector version of a loop. • The quality of such decisions can substantially improve if more accurate execution characteristics are available, which becomes possible under profile-guided optimization.

PGO steps • Phase 1 (Compile) • mpiifort –O3 –prof-gen -prof-dir dirx • Phase 2 (Run code, collect profile) • run_CubedSphere_BMK2.sh > BMK2.out 2>&1 • Produces .dyn files • comp_fv/49e5dea6_12099.dyn etc. etc. • Phase3 (Recompile) • mpiifort –O3 –prof-use –prof-dir dirx • ipo: remark #11000: performing multi-file optimizations • ipo-1: remark #11004: starting multi-object compilation • Phase 4 (Re-run code) • rerun_CubedSphere_BMK2.sh > BMK2.out 2>&1

PGO Outcome mxm • -O3 –prof-gen 96 seconds • -O3 –prof-use 10 seconds • -O2 27 seconds

Optimization Reports • -vec-report[n] • control amount of vectorizer diagnostic information • n=3 indicate vectorized/non-vectorized loops and prohibiting data dependence info • -opt-report [n] • generate an optimization report to stderr • n=3 maximum report output • -opt-report-file=<file> • specify the filename for the generated report • -opt-report-routine=<name> • reports on routines containing the given name

Inter procedure optimization (-ipo) • Multi-file ip optimizations that includes:- inline function expansion- interprocedural constant propogation- dead code elimination- propagation of function characteristics- passing arguments in registers- loop-invariant code motion

Compiler Options Simple MXM Example

CubedSphere Performance for various IFC Options

Summary of MPI options • Stacks available • OFED 1.3.1 / OFED 1.4.1 • MPI Implementations • Intel MPI 3,2 • Mvapich1 – 1.0.1 • Mvapich2 – 1.2.6 • OpenMPI 3.1 • Compilers • Intel compilers • intel-cce-10.1.017 • intel-fce-10.1.017 • PGI • Pathscale • gcc

NASA NCCS APPLICATION PERFORMANCE DISCUSSION

NASA NCCS APPLICATION PERFORMANCE DISCUSSION

Presentation Transcript

NCCS Middle School

DADDS Application Discussion

NCCS User Forum

NCCS User Forum

NCCS User Forum

NCCS Hardware

NCCS User Forum

NCCS Ticketing System

Enhancing Application Performance

NCCS User Forum

Application Performance

NCCS Mini Tour

SPDG Performance Measure Discussion

NCCS Data Goals

Application Performance Monitoring

Performance Funding Discussion Topics

NCCS User Forum

NCCS Hardware

NCCS User Forum

NCCS Hardware

NCCS Network Roadmap