Optimal coding practices for IBM POWER4 processors

Optimal coding practices for IBM POWER4 processors Getting the most out of AIX, xlf, and xlc or Steve Behling IBM Corporation sbehling@us.ibm.com

Outline • Some hardware details • Some software discussions • My favorite hints • Questions

CPU Register Cache Main Memory Disk Massive Tape Storage Memory Hierarchy Speed Size 1 cycle TLB miss: tens to hundreds of cycles Cache miss: 8-200 cycles ~ 100,000 cycles Don't want to know

POWER4 processor chip layout • Contains two 64-bit processors (PowerPC architecture) • POWER4 has 1.4 MB (1440 KB) L2 cache; POWER4+ has 1.5 MB L2 cache • L3 cache directory on chip • All chip frequencies scale with processor frequency

High-frequency, speculative execution, superscalar processor with out-of-order instruction execution capabilities • Eight independent execution units (capable of executing instructions in parallel) = superscalar • Two identical floating-point execution units; each with 2 floating-point operations per cycle • Two load/store execution units • Two fixed-point execution units • One branch execution unit • One conditional register unit to perform logical operations on the condition register • Only one of the FPUs does divides POWER4 Processor Features

POWER4 Instruction Issue Block Diagram

p690 Multi-Chip Module (MCM)

IBM 32 processor pSeries 690

Cache Organization Capacity Cache Organization and Size L1 instruction cache Direct map, 128-byte line 64 KB per processor L1 data cache Two-way set associative, 128-byte cache line 32 KB per processor Shared L2 cache POWER4 mostly eight-way, some four-way; POWER4+ all eight-way 1.4 MB per chip POWER4; 1.5 MB per chip POWER4+ L3 cache Eight-way. Two boot modes: 1 cache line per transfer or 4 cache lines per transfer 128 MB per MCM

Virtual storage is the addressable memory space used by the AIX operating system Virtual Memory Manager This linear contiguous address space is mapped, by a combination of hardware and software, onto the hardware memory of the computer and onto disk paging space(s) Pages are 4096 bytes on POWER3 and earlier hardware. Pages on POWER4 can be 4096 bytes, 16 MB, and 256 MB (requires AIX 5.1.0.25)

TLB holds the information to translate between virtual and physical memory addresses. If page is in TLB; no cost translation. Translation Lookaside Buffer (TLB) • TLB misses are likely when using indirect addressing. L=left_neighbor[i]; R=right_neighbor[i]; a[i] += b[i]*a[L] + c[i]*a[R]; • The cost of TLB misses varies between ~25 cycles to possibly hundreds of cycles in unfavorable cases

Hardware data prefetch • IBM POWER4 has 8 hardware prefetch streams. • 2 sequential cache line accesses (forward or backward) establish a prefetch stream • Prefetch streams stop when they reach a page boundary. • Prefetching can be encouraged using compiler directives or code changes • Prefetch streams only get established for loads • Can use PREFETCH_BY_LOAD() directive for store do 10 i=1,NCELL !IBM$ PREFETCH_BY_LOAD(i+33) a(i)=0.0 10 continue

Example: Dot product. 2 prefetch streams double s; double *a, *b; .... s=0.0; for(i=0;i<N;i++) s = s + a[i]*b[i]; Coding for prefetch performance Example: Interleaved dot product. 6 prefetch streams double s,s1,s2; double *a, *b; int onethird,twothird; .... s = s1 = s2 = 0.0; onethird = N/3; twothird = 2*onethird; for(i=0;i<onethird;i++) { s = s + a[i]*b[i]; s1 = s1 + a[i+onethird]*b[i+onethird]; s2 = s2 + a[i+twothird]*b[i+twothird]; } for(i=3*onethird;i<N;i++) s = s + a[i]*b[i]; s = s + s1 + s2;

AIX Large pages • 16 MB large pages help HPC application performance by: • Eliminating TLB misses • Enhancing prefetch since prefetch streams get reset at page boundary • Typically 5 to 15 % improvement • Some start up overhead since each task gets full 256 MB segment (16 pages). • Deadly for scripts; may be bad for fork(), execlp() • If large pages are exhausted, jobs silently fall over to use small pages • Watch with “vmstat –l”

AIX Large Page Administration • AIX can set aside memory to be backed by large pages (typically 50%) • vmtune -g nnn –L mmm • bosboot –a; reboot • Application can be large page enabled • ldedit –b lpdata a.out • User must be enabled: • chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE userid • Or set default in /etc/security/users

TLB coverage POWER3: • TLB contained 256 entries. • TLB coverage is 1 MB (smaller than L2 cache) POWER4: • TLB contains 1024 entries • TLB coverage is 4 MB for small pages • TLB coverage is 16 GB for large pages

TLBexample (xlf -WF,-DHPM …) • program stand • #ifdef HPM • #include "f_hpm.h" • #endif • parameter (NCELL=400) • common /mystuff/ a1,a2,a3 • real(8) a1(NCELL,NCELL,NCELL) • real(8) a2(NCELL,NCELL,NCELL) • real(8) a3(NCELL,NCELL,NCELL) • real(8) time1,time2,rtc,etime,s • c • a1 = 1.0d0 • a2 = 2.0d0 • #ifdef HPM • call f_hpminit(0,"Job") • call f_hpmstart(1,"Total_routine") • #else • time1=rtc() • #endif • call sub1(a1,a2,a3,NCELL) • #ifdef HPM • call f_hpmstop(1) • call f_hpmterminate(0) • #else • time2=rtc() • etime=time2-time1 • print *,'Subroutine took ', etime,' seconds' • #endif • end

TLB subroutines and performance subroutine sub_nest(a1,a2,a3,n) parameter (NCELL=400) real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) integer(4) n integer(4) i,j,k real(8) s ! s=1.1d0 do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

TLB: performance, 375 MHz Power3, 4 MB L2 cache do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end Time = 21.5 s. 329.7 LD/TLB miss do 10 i=1,NCELL do 10 j=1,NCELL do 10 k=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end Time = 980.6 s. 0.667 LD/TLB miss do 10 k=1,NCELL do 10 i=1,NCELL do 10 j=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end Time = 178.0 s. 0.853 LD/TLB miss

Favorite hints • Put “export AIXTHREAD_SCOPE=S” in your .profile • -g does not decrease optimization • First compile: -O2 –qarch=pwr4 –qtune=pwr4 –qmaxmem=-1 • C: use –qlibansi • Fortran: use xlf90 -qfixed • Most likely to get within 5% of optimal performance using –O3 • May need to use –qstrict • Use –lmass if you use any intrinsics (sqrt, exp, **, etc.) • Try –O4; -qhot; -qalias=allptrs (C) etc. on individual routines. • OpenMP: use guided scheduling. –qsmp=omp,noauto

Favorite hints (cont.) • MPI codes run very well on SMP systems • MP_SHARED_MEMORY=yes • MP_WAIT_MODE=poll • (MPICH ch_shmem is pretty good, too, if you build it with –O3 –qarch=pwr4 –qtune=pwr4 -- at least through 8 processors) • If you do lots of 64-bit integer arithmetic use –q64 so you can exploit the PowerPC 64-bit integer hardware. • Use “nmon” for low overhead, curses based system monitoring program. • dbx a.out core is OK, but Totalview is awesome. • Don’t use –bmaxdata with –q64 • Use –bmaxdata:0x80000000/dsa with –q32

End

L3 Cache (POWER4 only) • Four POWER4 chips are combined into a multi-chip module (MCM) each of which has a 128 MB level 3 cache • L3 cache is 8-way set associative • L3 cache may be bypassed if busy • Consequence: Data may not be where you think it is. • On p690, L3 cache is shared system wide.

POWER4: For optimal performance it is recommended to block data for L2 cache and to structure the data access for the L1 data cache Tuning Recomendation

A multiply/add counts as two floating point operations, so that, for example, a program doing only additions might run at half the MFlops rate of one doing alternate multiplies and adds Use FMA for best performance /* bad code */ for(i=0; i<N; i++) a[i] = s*a[i]; printf("I did the multiply loop.\n"); for(i=0; i<N; i++) a[i] = b[i]+a[i]; /* good code */ for(i=0; i<N; i++) a[i] = b[i] + s*a[i]; Note: C++ operator overloading could result in “bad code” – requires careful examination

Operate within L1 and L2 cache via blocking • Avoid TLB misses (Stride 1 as much as possible) • Multiplies must be paired with adds or subtracts so that each FMA is two flops • FMAs must be independent (and at least eight in number to keep two pipes of depth four going) How to get the most MFlops

Peak Mflops example !Matrix multiply kernel do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo ! Same code but scalar explicitly stated ! Good, but load/store bound do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) s =d(i,j) do k=kk,min(n,kk+nb-1) s =s +a(j,k)*b(k,i) enddo d(i,j)=s enddo enddo

Peak Mflops (cont.) 5x4 hand unrolling to maximize FMA and register usage do i=ii,min(n,ii+nb-1),5 do j=jj,min(n,jj+nb-1),4 s00 =d(i+0,j+0) s10 =d(i+1,j+0) s20 =d(i+2,j+0) s30 =d(i+3,j+0) s40 =d(i+4,j+0) s01 =d(i+0,j+1) s11 =d(i+1,j+1) s21 =d(i+2,j+1) s31 =d(i+3,j+1) s41 =d(i+4,j+1) s02 =d(i+0,j+2) s12 =d(i+1,j+2) s22 =d(i+2,j+2) s32 =d(i+3,j+2) s42 =d(i+4,j+2) s03 =d(i+0,j+3) s13 =d(i+1,j+3) s23 =d(i+2,j+3) s33 =d(i+3,j+3) s43 =d(i+4,j+3) do k=kk,min(n,kk+nb-1) s00 =s00 +a(j+0,k)*b(k,i+0) s10 =s10 +a(j+0,k)*b(k,i+1) s20 =s20 +a(j+0,k)*b(k,i+2) s30 =s30 +a(j+0,k)*b(k,i+3) s40 =s40 +a(j+0,k)*b(k,i+4) s01 =s01 +a(j+1,k)*b(k,i+0) s11 =s11 +a(j+1,k)*b(k,i+1) s21 =s21 +a(j+1,k)*b(k,i+2) s31 =s31 +a(j+1,k)*b(k,i+3) s41 =s41 +a(j+1,k)*b(k,i+4) s02 =s02 +a(j+2,k)*b(k,i+0) s12 =s12 +a(j+2,k)*b(k,i+1) s22 =s22 +a(j+2,k)*b(k,i+2) s32 =s32 +a(j+2,k)*b(k,i+3) s42 =s42 +a(j+2,k)*b(k,i+4) s03 =s03 +a(j+3,k)*b(k,i+0) s13 =s13 +a(j+3,k)*b(k,i+1) s23 =s23 +a(j+3,k)*b(k,i+2) s33 =s33 +a(j+3,k)*b(k,i+3) s43 =s43 +a(j+3,k)*b(k,i+4) enddo d(i+0,j+0)=s00 d(i+1,j+0)=s10 d(i+2,j+0)=s20 d(i+3,j+0)=s30 d(i+4,j+0)=s40 d(i+0,j+1)=s01 d(i+1,j+1)=s11 d(i+2,j+1)=s21 d(i+3,j+1)=s31 d(i+4,j+1)=s41 d(i+0,j+2)=s02 d(i+1,j+2)=s12 d(i+2,j+2)=s22 d(i+3,j+2)=s32 d(i+4,j+2)=s42 d(i+0,j+3)=s03 d(i+1,j+3)=s13 d(i+2,j+3)=s23 d(i+3,j+3)=s33 d(i+4,j+3)=s43 enddo enddo

Avoid divides – only one FPU on Power4 does divides! For simple cases, compiler does this for you. Untuned Tuned ------- ----- DO I=1,N DO I=1,N A(I)=B(I)/C(I) OC=1.0/C(I) P(I)=Q(I)/C(I) A(I)=B(I)*OC ENDDO P(I)=Q(I)*OC ENDDO Clever method to replace 2 divides by 1 divide and 5 multiplies and use both FPUs Untuned Tuned ------- ----- DO I=1,N DO I=1,N A(I)=B(I)/C(I) OCD=1.0/(C(I)*D(I)) P(I)=Q(I)/D(I) A(I)=B(I)*D(I)*OCD ENDDO P(I)=Q(I)*C(I)*OCD ENDDO

Minimize expensive intrinsic calls Untuned Tuned ------- ----- DO I=1,N DIMENSION SINX(N) DO J=1,N ... A(J,I)=B(J,I)*SIN(X(J)) DO J=1,N ENDDO SINX(J)=SIN(X(J)) ENDDO ENDDO DO I=1,N DO J=1,N A(J,I)=B(J,I)*SINX(J) ENDDO ENDDO

Optimal coding practices for IBM POWER4 processors

Optimal coding practices for IBM POWER4 processors

Presentation Transcript

Processors for Educators

User experiences on Heterogeneous TACC IBM Power4 System

Optimal Content Delivery with Network Coding

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors

Best Practices in Coding

Video Coding on Multi-core Graphics Processors

Coding Practices Integrating Tools for BR Dev

TACC/NPACI IBM Regatta-HPC (Power4) Overview

Optimal Linear Interpolation Coding for Server-based Computing

Towards Optimal Custom Instruction Processors

SQL Coding Best Practices for Developers

Redundant Slice Optimal Allocation for H.264 Multiple Description Coding

Montecito and POWER4

Operationally Optimal VERTEX-BASED SHAPE CODING

Best Coding Practices for Template 4

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming

Optimal Infant Feeding practices includes

Best Practices in Coding

Tips for Better Coding Practices in Python

5 optimal practices for Instagram Advertising

Website Performance Tuning Best Practices for Optimal Results

CRM Best Practices For Optimal Success In 2024