Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer

Performance Issues Application Programmers ViewJohn CownieHPC Benchmark Engineer

Agenda • Running 64bit 32bit codes under AMD64 (Suse Linux) • FPU and Memory performance issues • OS and memory layout (1P, 2P, 4P) • Spectrum of application needs memory cpu • Benchmark examples STREAM, HPL • Some real applications • Conclusions

USER 32-bit thread 64-bit thread 32-bitApplication 64-bit Application 4GB expanded address space 512GB (or 8TB) address space Translation 64-bit Operating System 64-bit Device Drivers KERNEL 64-bit OS & Application Interaction 32-bit Compatibility Mode • 64-bit OS runs existing 32-bit APPs with leading edge performance • No recompile required, 32-bit code directly executed by CPU • 64-bit OS provides 32-bit libraries and “thunking” translation layer for 32-bit system calls. 64-bit Mode • Migrate only where warranted, and at the user’s pace to fully exploit AMD64 • 64-bit OS requires all kernel-level programs & drivers to be ported. • Any program that is linked or plugged in to a 64-bit program (ABI-level) must be ported to 64-bits.

64-bit 64-bit OS OS 0 GB 0 GB 32-bit 32-bit App App Virtual 4GB Virtual Virtual 12GB Virtual Memory Memory DRAM Memory DRAM Memory 2 GB 2 GB 32-bit 32-bit OS OS 4 GB 4 GB Increased Memory for 32-bit Applications 32-bit server, 4 GB DRAM • OS & App share small 32-bit VM space • 32-bit OS & applications all share 4GB DRAM • Leads to small dataset sizes & lots of paging Shared 64-bit server, 12 GB DRAM 0 GB 0 GB • App has exclusive use of 32-bit VM space • 64-bit OS can allocate each application large dedicated portions of 12GB DRAM • OS uses VM space way above 32-bits • Leads to larger dataset sizes & reduced paging 32-bit App 32-bit App Not shared 4 GB 4 GB Not shared Not shared 256 TB 256 TB

Compiler OS Base Peak Intel C/C++ 7.0 Windows Server 2003 1095 1170 Intel C/C++ 7.0 Linux/x86-64 1081 1108 Intel C/C++ 7.0 Linux (32-bit) 1062 1100 GCC 3.3 (64-bit) Linux/x86-64 1045 GCC 3.3 (32-bit) Linux/x86-64 980 GCC 3.3 (32-bit) Linux (32-bit) 960 AMD64 Developer Tools • GNU compilers • GCC 3.2.2 - 32-bit and 64-bit • GCC 3.3 - optimized 64-bit Optimized Compilers Are Reaching Production Quality 1.8 MHz AMD Opteron Processor– SPECint2000 • PGI Workstation 5.0 beta • Windows and 64-bit Linux compilers http://www.aceshardware.com/ Optimized Fortran 77/90, C,C++ Good flags –O2-fastsse

Running 32bit and 64bit codes • 64bit addresses memory can now the big (>4Gbytes…) boris@quartet4:~>more /proc/meminfo total: used: free: shared: buffers: cached: Mem: 7209127936 285351936 6923776000 0 45547520 140636160 Swap: 1077469184 0 1077469184 MemTotal: 7040164 kB • OS has both 32bit and 64bit libraries… boris@quartet4:/usr>ls X11 bin games lib local sbin src x86_64-suse-linux X11R6 include lib64 pgi share tmp • For gcc 64bit addressing is the default use –m32 for 32bit… • (Don’t confuse 64bit floating point data operations with addressing and pointer lengths..)

Running 32bit program boris@quartet4:~/c> gcc –m32 -o test_32bit sizeof_test.c boris@quartet4:~/c> ./test_32bit char is 1 short is 2 int is 4 long is 4 long long is 8 unsigned long long is 8 float is 4 double is 8 int * is 4

Running 64bit program • Pointers are now 64bits long boris@quartet4:~/c> gcc -o test_64bit sizeof_test.c boris@quartet4:~/c> ./test_64bit char is 1 short is 2 int is 4 long is 8 long long is 8 unsigned long long is 8 float is 4 double is 8 int * is 8

Compilers and flags • Intel icc/ifc 32bit code compiled on 64bit OS use –W,-m elf_i386 to tell linker to use 32bit libraries. • Intel icc/ifc avoid –xaW (tests cpu id) use –xW to enable SSE • PGI pgcc/pgf90 Vector generates prefetch and –Mnontemporal streaming store instructions • Absoft f90 looks promising • GNU g77 front end limited but gnu backend is good • GNU gcc 3.3 is best good (rpm gcc33 perf has faster libraries ?) • GNU g++ good common code generator • GNU gcc 3.2 good The more compilers the better !

The Portland Group Compiler Technology PGI Compiler - Additional Features • Plans to bundle useful libraries • See www.spec.org for SPEC numbers… • MPI-CH - Pre-configured libraries and utilities for ethernet- based x86 and AMD64/Linux clusters • PBS – Portable Batch System batch-queuing from NASA Ames and MRJ Technologies • ScaLAPACK - Pre-compiled distributed-memory parallel Math Library • ACML – The AMD Core Math Library is planned to be included • Training – Tutorials (OSC), exercises, examples and benchmarks for MPI, OpenMP and HPF programming

Open Source Tools GNU means GNU's Not UNIXTM" and is the primary project of by the Free Software Foundation (FSF), a non-profit organization committed to the creation of a large body of useful, free, source-code-available software. TOP

32-bit vs 64-bit App Performance

BLAS libraries • 3 different BLAS libraries support 32bit and 64bit code • ACML (includes FFTs) • ATLAS • Goto • Currently Goto has fastest DGEMM. ~88% of peak on 1P HPL • Compare with BLASBENCH and pick best one for your application. • For FFTs also consider FFTW

Optimized Numerical Libraries: ACML • AMD and The Numerical Algorithms Group (NAG) joint development the AMD Core Math Library (ACML) • ACML includes • Basic Linear Algebra Subroutines (BLAS) levels 1, 2 and 3 • A wide variety of Fast Fourier Transforms (FFTs) • Linear Algebra Package (LAPACK) • ACML has: • Fortran and C Interfaces • Highly optimized routines for the AMD64 Instruction Set • Ability to address single-, double-, single-complex and double-complex data types • Will be available for commercially available OSs • ACML is freely downloadable from www.developwithamd.com/acml

DGEMM relative performance K Goto DGEMM 88% of peak FPU performance

Floating Point and Memory Performance

63 31 15 7 0 In x86 Added by x86-64 AL AH RAX EAX 127 0 79 0 SSE GPR MMX0 EAX x87 MMX7 R8 MMX8 XMM8 EIP R15 MMX15 Register Differences: x86-64 / x86-32 • x86-64 • 64-bit integer registers • 48-bit Virtual Address • 40-bit Physical Address • REX - Register Extensions • Sixteen 64-bit integer registers • Sixteen 128-bit SSE registers • SSE2 Instruction Set • Double precision scalarand vector operations • 16x8, 8x16 way vectorMMX operations • SSE1 already added with AMD Athlon MP EDI TOP

Floating point hardware • 4 cycle deep pipeline • Separate multiply and add paths • 64bit (double precision) 2flops/cycle (1mul + 1add) • 32bit (single precision) 4flops/cycle (2muls + 2 adds) • 2.0Ghz clock 1 cycle = 0.5ns gives... • Theoretical Peak 4Gflops double precision • SSE registers 128 bit wide…but packed instructions only help single precision • Pipeline depth and separate mul add mean that even register to register are helped by loop unrolling.

AMD Opteron™ Processor Architecture DRAM 5.3 GB/s128-bit DDR333 MCT CPU SRQ XBAR Coherent Hyper-TransportTM Coherent Hyper-TransportTM Coherent HyperTransportTM 3.2 GB/s per direction@ 800 MHz Dual Data Rate 6.4 GB/s @ 1600 MT/s Data Rate 3.2 GB/s per direction@ 1600 MT/s Data Rate TOP

Main Memory hardware • Dual on-chip memory controllers • Remote memory systems accesses via hyper-transport • Bandwidth scales with more processors • Latency is very good 1P,2P,4P (cache probes on 2P, 4P) • Local memory latency is less than remote memory • 2P machine 1 hop (worst case) • 4P machine 2 hops (worst case) • Memory DIMMS 333Mhz (2700) or 266Mhz (2100) • Can interleave memory banks (BIOS) • Can interleave processor memory (BIOS)

Peak Memory Bandwidth 64-Bit DCT 128-Bit DCT DDR200 PC1600 1.6GB/s 3.2GB/s DDR266 PC2100 2.1GB/s 4.2GB/s DDR333 PC2700 2.7GB/s 5.33GB/s Integrated Memory Controller Performance • Peak Bandwidth and Latency • Performance improves by almost 20% compared to AMD Athlon™ topology • Idle Latencies to First Data • 1P System: <80ns • 0-Hop in DP System: <80ns • 0-Hop in 4P System: ~100ns • 1-Hop in MP System: <115ns • 2-Hop in MP System: <150ns • 3-Hop in MP System: <190ns TOP

Integrated Memory ControllerLocal versus Remote Memory Access • 0 Hop: Local memory access • 1 Hop: Remote 1 memory access • 2 Hop: Remote 2 memory access • Probe: Coherency request between nodes • Diameter: Maximum hop count between any pair of nodes • Average distance: Average hop count between nodes 1-hop 0-hop 2-hops P0 P1 P1 P1 P0 P0 P3 P2 P3 P2 P2 P3 TOP

MP Memory Bandwidth Scalability TOP

How should the OS allocate memory ? • To maximize local accesses ? • To get best bandwidth across all processors ? • Different needs from different applications • Scientific MPI codes already have model of networked 1P machines. • Enterprise codes, databases, web servers lots of threads maybe throughput and scrambled memory is best ? • SMP kernel plus processor interleaving • NUMA kernel plus memory bank interleaving • Suse NUMACTL utility allows policy choice per process.

MPI libraries (shmem buffers where ?) • Argonne MPICH compile with gcc –O2 –funroll-all-loops • Myrinet, Quadrics, Dolphin, also have MPI libraries based on MPICH • On MP machine uses shared memory in the box. • Where do MPI message buffers go ? • Currently just malloc one chunk of space for all processors • OK for 2P … not so good for 4P machine (2 hops worst.. contention) • Scope to improve MPI performance on 4P NUMA machine with better buffer placement…

SyntheticBenchmarks

Spectrum of application needs • Some codes are memory limited (BIG data) • Others are CPU bound (kernels and SMALL data) • ALL memory codes like low latency to memory ! • Example in BLAS libraries • BLAS1 vector-vector … Memory Intensive …STREAM • BLAS2 matrix-vector • BLAS3 matrix-matrix … CPU Intensive … DGEMM, HPL • Faster CPU will not help memory bound problems… • PREFETCH can sometimes hide memory latency (on predictable memory access patterns.

LM Bench (memory latency) • LM bench has published comparative numbers • Fastest latency is in 1P Opteron machine – no HT cache probes • Note that HP sometimes reports a “within cache line stride” latency number of 110.6. Opteron’s “within cache line stride” latencies are 25ns for 16 bytes and 53ns for 32 bytes. TOP

Measured Memory latency • Physical limit.. Speed of light almost crosses board in 1ns • 2.0Ghz 0.5ns clock ticks… • L1 CACHE 1.5ns • L2 CACHE 7.7ns • Main Memory 89ns • Try to hide the big hit of main memory access. • Codes with predictable access patterns use PREFETCH (in various flavours) to hide latency.

High Performance Linpack • A CPU bound problem (memory system not stressed) • Solves A x = b using LU factorization • Results Peak Gflops rate achieved for matrix size NxN • Almost all time is spent in DGEMM • Use MPI message passing • Larger N the better – fill memory per node • Used in www.top500.org ranking of supercomputers • N (half) size what Gflops is half Nmax a measure of overhead • Current number one machine is Earth Simulator (40Tflops) Cray Red Storm (10,000 Opterons) has comparable peak

AMD Opteron™ uP 2 AMD Opteron™ uP 3 16x16 Coherent HyperTransport 16x16 Coherent HyperTransport 16x16 Coherent HyperTransport AMD Opteron™ Boot uP AMD Opteron™ uP 1 16x16 Coherent HyperTransport 200-333MHz 144-Bit Reg DDR 6.4GB/s HyperTransport 800MB/s HyperTransport 1000 BaseT Optional ATI Rage Dual GbE 64bits @ 100Mhz 64bits @ 66Mhz AMD-8131™ HyperTransport™ PCI-X Tunnel PCI-X U320 Dual SCSI AC’97 Legacy PCI USB 2.0 AMD-8111TM HyperTransportTM I/O Hub 1.6GB/s HyperTransport 10/100 Ethernet AMD-8131™ HyperTransport™ PCI-X Tunnel 64bits @ 133Mhz 64bits @ 133Mhz EIDE PCI-X LPC FLASH (BIOS) PCI-X Hot Plug SM Bus PCI-X Hot Plug SIO BMC Keyboard, mouse, hardware monitor, fan monitor LAN 10/100 Quartet: 4u4P AMD OpteronTM MP Processor Platform Agenda TOP

MPI vs threaded BLAS ? • BLAS libraries can do thread level parallelism to exploit MP node • MPI can treat MP node processors as separate machines talking via shmem. • Which is best ? • NUMA kernel allocates memory locally for each process • But in the box MPI on 4P has memory issues. • MPI and single threaded BLAS performs best with NUMA kernel. • Mixing OpenMP and MPI is possible and maybe sensible. • Generally static upfront decomposition feels better ?

N1/2(order) GOTO Library ResultsAMD Opteron™ system # P Rmax(GFlops) Nmax(order) Rpeak(GFlops) GFLOP/ Proc Rmax / Rpeak 1008 4P AMD Opteron 1.8GHz 2GB/proc PC2700 8GB Total 4 12.06 28000 14.4 3.02 83.8% 672 2P AMD Opteron 1.8GHz 2GB/proc PC2700 4GB Total 2 6.22 20617 7.2 3.11 86.4% 336 1P AMD Opteron 1.8GHz 2GB PC2700 1 3.14 15400 3.6 3.14 87.1% High Performance Linpack • High-Performance BLAS by Kazushige Goto • Optimized http://www.cs.utexas.edu/users/flame/goto GOTO results were with 64-bit SuSE 8.1 Linux Professional Edition with NUMA kernel and Myrinet MPIch-gm-1.2.5..10 message passing library.

HPL on 16P (4x4P) Opteron Cluster • Machine 4x4P 1.8Ghz 2Gb/processor with single myrinet and gigabit ethernet per box. • Goto DGEMM (single threaded) and MPICH-GM • Myrinet 8 processors N=40000 N(half)=2252 81.3% peak • Myrinet 16 processors N=41288 N(half)=3584 80.5% peak • Ethernet 16 processors N=54616 N(half)=5768 78.1% peak • A big run to show >4Gbytes/processor working • 4P 1.8Ghz 8Gb/processor (32Gbytes in all …266Mhz memory) • 4 processors N=60144 N(half)=1123 80.56%peak

STREAM • Measures sustainable memory bandwidth (in MB/s) • Simple kernels on large vectors (~50Mbytes) • Vectors must be much bigger than cache • Machine balance defined as: peak floating ops/cycle / sustained memory ops/cycle

Compiling STREAM • PGI compiler recognises Open MP threads directives and generates prefetch instructions and streaming stores pgf77 -Mnontemporal -O3 -fastsse -Minfo=all -Bstatic -mp -c -o stream_d_f.o stream_d.f 180, Parallel loop activated; static block iteration allocation Generating sse code for inner loop Generated prefetch instructions for 2 loads

STREAM results • 4P Opteron 2.0Ghz with 333Mhz memory • Compiled pgf77 –O3 –mp –fastsse -Mnontemporal • 64bit word flops • Rates in Mbytes/sec • Triad ~310Mflops /processor (About the same as a CrayYMP ?)

ApplicationBenchmarks

Greedy – Travelling Salesman Problem • Solves the traveling salesman problem and is sensitive to memory latency and bandwidth. • An example of increasing importance of memory performance as problems grows from small to large. TOP

OPA -Parallel Ocean Model • A large scientific code from France… • The uses LAM-MPI. • Compiled in France with ifc 7.1 binary run in USA using the same 32bit version of LAM-MPI, ifc runtime and LD_LIBRARY_PATH settings.

PARAPYR • Direct Numerical Simulation of Turbulent Combustion • Single processor performance on 1P system • Suse AMD64 Linux NUMA kernel • Problem size (small) 340x340 • Mflops are double precision (64bits) • Run identical statically linked binary on Intel P4 2.8Ghz (dell) • Compiled with ifc 7.1 –r8 -ip • Opteron 1.8Ghz 420 Mflops • P4 2.8Ghz 351 Mflops Best Opteron results (vs beta Absoft and PGI compilers) 435Mflops ifc –r8 –xW –static 437Mflops ifc -r8 -O2 -xW -ipo -static -FDO (-FDO pass1 = -prof_gen pass2=-prof_use

PARAPYR Best 1.8Ghz Opteron ifc results 435Mflops ifc –r8 –xW –static 437Mflops ifc -r8 -O2 -xW -ipo -static -FDO (-FDO pass1 = -prof_gen pass2=-prof_use)

Conclusions • Use AMD64 64bit OS with NUMA support • 32bit compiled applications run well Know you applications memory or cpu needs (so you know what to expect) 64bit compilers need work (as ever) • Competitive processors Itanium 1.5Ghz and Xeon 3.0Ghz have higher peak FLOPs but relatively poor memory scaling Highly tuned benchmarks which make heavy use of floating point and which fit in cache or make low use of the memory system perform better on Itanium Xeon memory system does not scale well • Opteron has excellent memory system performance and scalability – both bandwidth and latency Codes that depend on memory latency or bandwidth perform better on Opteron Codes with a mix of integer and floating point will perform better on Opteron Code that is not highly tuned will likely perform better on Opteron • More Upside from MPI 4P memory layout (MPICH2 ?) and 64bit compilers TOP

AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, 3DNow! and combinations thereof, AMD-8100, AMD-8111 and AMD-8131 AMD-8151 are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Microsoft and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. Pentium and MMX are registered trademarks of Intel Corporation in the U.S. and/or other jurisdictions. SPEC and SPECfp are registered trademarks of Standard Performance Evaluation Corporation in the U.S. and/or other jurisdictions. Other product and company names used in this presentation are for identification purposes only and may be trademarks of their respective companies.

Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer

Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer

Presentation Transcript

Making HPC System Acquisition Decisions Is an HPC Application

4.7.3 High Performance Concrete (HPC)

HPC and the ROMS BENCHMARK Program

Improving HPC Application Performance in Cloud through Dynamic Load Balancing

Improving HPC Application Performance in Cloud through Dynamic Load Balancing

Investigation of Leading HPC I/O Performance Using a Scientific Application Derived Benchmark

Performance Engineering in HPC Application Development

The HPC Challenge (HPCC) Benchmark Suite

Making HPC System Acquisition Decisions Is an HPC Application

The HPC Challenge (HPCC) Benchmark Suite

Creating an HPC application framework from a mere HPC application automatically

Issues of HPC software

iPhone Application Programmers

1KEY HPC - High Performance Computing

Investigation of Leading HPC I/O Performance Using a Scientific Application Derived Benchmark

NetCDF4 Performance Benchmark

Human Performance Center (HPC)

The HPC Challenge (HPCC) Benchmark Suite

High Performance Computing (HPC) Market

Addressing Web Application Performance Issues