1 / 30

Optimal coding practices for IBM POWER4 processors

Optimal coding practices for IBM POWER4 processors. Getting the most out of AIX, xlf, and xlc. or. Steve Behling IBM Corporation sbehling@us.ibm.com. Outline. Some hardware details Some software discussions My favorite hints Questions. CPU. Register. Cache. Main Memory. Disk.

keegan-rice
Download Presentation

Optimal coding practices for IBM POWER4 processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal coding practices for IBM POWER4 processors Getting the most out of AIX, xlf, and xlc or Steve Behling IBM Corporation sbehling@us.ibm.com

  2. Outline • Some hardware details • Some software discussions • My favorite hints • Questions

  3. CPU Register Cache Main Memory Disk Massive Tape Storage Memory Hierarchy Speed Size 1 cycle TLB miss: tens to hundreds of cycles Cache miss: 8-200 cycles ~ 100,000 cycles Don't want to know

  4. POWER4 processor chip layout • Contains two 64-bit processors (PowerPC architecture) • POWER4 has 1.4 MB (1440 KB) L2 cache; POWER4+ has 1.5 MB L2 cache • L3 cache directory on chip • All chip frequencies scale with processor frequency

  5. High-frequency, speculative execution, superscalar processor with out-of-order instruction execution capabilities • Eight independent execution units (capable of executing instructions in parallel) = superscalar • Two identical floating-point execution units; each with 2 floating-point operations per cycle • Two load/store execution units • Two fixed-point execution units • One branch execution unit • One conditional register unit to perform logical operations on the condition register • Only one of the FPUs does divides POWER4 Processor Features

  6. POWER4 Instruction Issue Block Diagram

  7. p690 Multi-Chip Module (MCM)

  8. IBM 32 processor pSeries 690

  9. Cache Organization Capacity Cache Organization and Size L1 instruction cache Direct map, 128-byte line 64 KB per processor L1 data cache Two-way set associative, 128-byte cache line 32 KB per processor Shared L2 cache POWER4 mostly eight-way, some four-way; POWER4+ all eight-way 1.4 MB per chip POWER4; 1.5 MB per chip POWER4+ L3 cache Eight-way. Two boot modes: 1 cache line per transfer or 4 cache lines per transfer 128 MB per MCM

  10. Virtual storage is the addressable memory space used by the AIX operating system Virtual Memory Manager This linear contiguous address space is mapped, by a combination of hardware and software, onto the hardware memory of the computer and onto disk paging space(s) Pages are 4096 bytes on POWER3 and earlier hardware. Pages on POWER4 can be 4096 bytes, 16 MB, and 256 MB (requires AIX 5.1.0.25)

  11. TLB holds the information to translate between virtual and physical memory addresses. If page is in TLB; no cost translation. Translation Lookaside Buffer (TLB) • TLB misses are likely when using indirect addressing. L=left_neighbor[i]; R=right_neighbor[i]; a[i] += b[i]*a[L] + c[i]*a[R]; • The cost of TLB misses varies between ~25 cycles to possibly hundreds of cycles in unfavorable cases

  12. Hardware data prefetch • IBM POWER4 has 8 hardware prefetch streams. • 2 sequential cache line accesses (forward or backward) establish a prefetch stream • Prefetch streams stop when they reach a page boundary. • Prefetching can be encouraged using compiler directives or code changes • Prefetch streams only get established for loads • Can use PREFETCH_BY_LOAD() directive for store do 10 i=1,NCELL !IBM$ PREFETCH_BY_LOAD(i+33) a(i)=0.0 10 continue

  13. Example: Dot product. 2 prefetch streams double s; double *a, *b; .... s=0.0; for(i=0;i<N;i++) s = s + a[i]*b[i]; Coding for prefetch performance Example: Interleaved dot product. 6 prefetch streams double s,s1,s2; double *a, *b; int onethird,twothird; .... s = s1 = s2 = 0.0; onethird = N/3; twothird = 2*onethird; for(i=0;i<onethird;i++) { s = s + a[i]*b[i]; s1 = s1 + a[i+onethird]*b[i+onethird]; s2 = s2 + a[i+twothird]*b[i+twothird]; } for(i=3*onethird;i<N;i++) s = s + a[i]*b[i]; s = s + s1 + s2;

  14. AIX Large pages • 16 MB large pages help HPC application performance by: • Eliminating TLB misses • Enhancing prefetch since prefetch streams get reset at page boundary • Typically 5 to 15 % improvement • Some start up overhead since each task gets full 256 MB segment (16 pages). • Deadly for scripts; may be bad for fork(), execlp() • If large pages are exhausted, jobs silently fall over to use small pages • Watch with “vmstat –l”

  15. AIX Large Page Administration • AIX can set aside memory to be backed by large pages (typically 50%) • vmtune -g nnn –L mmm • bosboot –a; reboot • Application can be large page enabled • ldedit –b lpdata a.out • User must be enabled: • chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE userid • Or set default in /etc/security/users

  16. TLB coverage POWER3: • TLB contained 256 entries. • TLB coverage is 1 MB (smaller than L2 cache) POWER4: • TLB contains 1024 entries • TLB coverage is 4 MB for small pages • TLB coverage is 16 GB for large pages

  17. TLBexample (xlf -WF,-DHPM …) • program stand • #ifdef HPM • #include "f_hpm.h" • #endif • parameter (NCELL=400) • common /mystuff/ a1,a2,a3 • real(8) a1(NCELL,NCELL,NCELL) • real(8) a2(NCELL,NCELL,NCELL) • real(8) a3(NCELL,NCELL,NCELL) • real(8) time1,time2,rtc,etime,s • c • a1 = 1.0d0 • a2 = 2.0d0 • #ifdef HPM • call f_hpminit(0,"Job") • call f_hpmstart(1,"Total_routine") • #else • time1=rtc() • #endif • call sub1(a1,a2,a3,NCELL) • #ifdef HPM • call f_hpmstop(1) • call f_hpmterminate(0) • #else • time2=rtc() • etime=time2-time1 • print *,'Subroutine took ', etime,' seconds' • #endif • end

  18. TLB subroutines and performance subroutine sub_nest(a1,a2,a3,n) parameter (NCELL=400) real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) integer(4) n integer(4) i,j,k real(8) s ! s=1.1d0 do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end

  19. TLB: performance, 375 MHz Power3, 4 MB L2 cache do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end Time = 21.5 s. 329.7 LD/TLB miss do 10 i=1,NCELL do 10 j=1,NCELL do 10 k=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end Time = 980.6 s. 0.667 LD/TLB miss do 10 k=1,NCELL do 10 i=1,NCELL do 10 j=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end Time = 178.0 s. 0.853 LD/TLB miss

  20. Favorite hints • Put “export AIXTHREAD_SCOPE=S” in your .profile • -g does not decrease optimization • First compile: -O2 –qarch=pwr4 –qtune=pwr4 –qmaxmem=-1 • C: use –qlibansi • Fortran: use xlf90 -qfixed • Most likely to get within 5% of optimal performance using –O3 • May need to use –qstrict • Use –lmass if you use any intrinsics (sqrt, exp, **, etc.) • Try –O4; -qhot; -qalias=allptrs (C) etc. on individual routines. • OpenMP: use guided scheduling. –qsmp=omp,noauto

  21. Favorite hints (cont.) • MPI codes run very well on SMP systems • MP_SHARED_MEMORY=yes • MP_WAIT_MODE=poll • (MPICH ch_shmem is pretty good, too, if you build it with –O3 –qarch=pwr4 –qtune=pwr4 -- at least through 8 processors) • If you do lots of 64-bit integer arithmetic use –q64 so you can exploit the PowerPC 64-bit integer hardware. • Use “nmon” for low overhead, curses based system monitoring program. • dbx a.out core is OK, but Totalview is awesome. • Don’t use –bmaxdata with –q64 • Use –bmaxdata:0x80000000/dsa with –q32

  22. End

  23. L3 Cache (POWER4 only) • Four POWER4 chips are combined into a multi-chip module (MCM) each of which has a 128 MB level 3 cache • L3 cache is 8-way set associative • L3 cache may be bypassed if busy • Consequence: Data may not be where you think it is. • On p690, L3 cache is shared system wide.

  24. POWER4: For optimal performance it is recommended to block data for L2 cache and to structure the data access for the L1 data cache Tuning Recomendation

  25. A multiply/add counts as two floating point operations, so that, for example, a program doing only additions might run at half the MFlops rate of one doing alternate multiplies and adds Use FMA for best performance /* bad code */ for(i=0; i<N; i++) a[i] = s*a[i]; printf("I did the multiply loop.\n"); for(i=0; i<N; i++) a[i] = b[i]+a[i]; /* good code */ for(i=0; i<N; i++) a[i] = b[i] + s*a[i]; Note: C++ operator overloading could result in “bad code” – requires careful examination

  26. Operate within L1 and L2 cache via blocking • Avoid TLB misses (Stride 1 as much as possible) • Multiplies must be paired with adds or subtracts so that each FMA is two flops • FMAs must be independent (and at least eight in number to keep two pipes of depth four going) How to get the most MFlops

  27. Peak Mflops example !Matrix multiply kernel do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo ! Same code but scalar explicitly stated ! Good, but load/store bound do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) s =d(i,j) do k=kk,min(n,kk+nb-1) s =s +a(j,k)*b(k,i) enddo d(i,j)=s enddo enddo

  28. Peak Mflops (cont.) 5x4 hand unrolling to maximize FMA and register usage do i=ii,min(n,ii+nb-1),5 do j=jj,min(n,jj+nb-1),4 s00 =d(i+0,j+0) s10 =d(i+1,j+0) s20 =d(i+2,j+0) s30 =d(i+3,j+0) s40 =d(i+4,j+0) s01 =d(i+0,j+1) s11 =d(i+1,j+1) s21 =d(i+2,j+1) s31 =d(i+3,j+1) s41 =d(i+4,j+1) s02 =d(i+0,j+2) s12 =d(i+1,j+2) s22 =d(i+2,j+2) s32 =d(i+3,j+2) s42 =d(i+4,j+2) s03 =d(i+0,j+3) s13 =d(i+1,j+3) s23 =d(i+2,j+3) s33 =d(i+3,j+3) s43 =d(i+4,j+3) do k=kk,min(n,kk+nb-1) s00 =s00 +a(j+0,k)*b(k,i+0) s10 =s10 +a(j+0,k)*b(k,i+1) s20 =s20 +a(j+0,k)*b(k,i+2) s30 =s30 +a(j+0,k)*b(k,i+3) s40 =s40 +a(j+0,k)*b(k,i+4) s01 =s01 +a(j+1,k)*b(k,i+0) s11 =s11 +a(j+1,k)*b(k,i+1) s21 =s21 +a(j+1,k)*b(k,i+2) s31 =s31 +a(j+1,k)*b(k,i+3) s41 =s41 +a(j+1,k)*b(k,i+4) s02 =s02 +a(j+2,k)*b(k,i+0) s12 =s12 +a(j+2,k)*b(k,i+1) s22 =s22 +a(j+2,k)*b(k,i+2) s32 =s32 +a(j+2,k)*b(k,i+3) s42 =s42 +a(j+2,k)*b(k,i+4) s03 =s03 +a(j+3,k)*b(k,i+0) s13 =s13 +a(j+3,k)*b(k,i+1) s23 =s23 +a(j+3,k)*b(k,i+2) s33 =s33 +a(j+3,k)*b(k,i+3) s43 =s43 +a(j+3,k)*b(k,i+4) enddo d(i+0,j+0)=s00 d(i+1,j+0)=s10 d(i+2,j+0)=s20 d(i+3,j+0)=s30 d(i+4,j+0)=s40 d(i+0,j+1)=s01 d(i+1,j+1)=s11 d(i+2,j+1)=s21 d(i+3,j+1)=s31 d(i+4,j+1)=s41 d(i+0,j+2)=s02 d(i+1,j+2)=s12 d(i+2,j+2)=s22 d(i+3,j+2)=s32 d(i+4,j+2)=s42 d(i+0,j+3)=s03 d(i+1,j+3)=s13 d(i+2,j+3)=s23 d(i+3,j+3)=s33 d(i+4,j+3)=s43 enddo enddo

  29. Avoid divides – only one FPU on Power4 does divides! For simple cases, compiler does this for you. Untuned Tuned ------- ----- DO I=1,N DO I=1,N A(I)=B(I)/C(I) OC=1.0/C(I) P(I)=Q(I)/C(I) A(I)=B(I)*OC ENDDO P(I)=Q(I)*OC ENDDO Clever method to replace 2 divides by 1 divide and 5 multiplies and use both FPUs Untuned Tuned ------- ----- DO I=1,N DO I=1,N A(I)=B(I)/C(I) OCD=1.0/(C(I)*D(I)) P(I)=Q(I)/D(I) A(I)=B(I)*D(I)*OCD ENDDO P(I)=Q(I)*C(I)*OCD ENDDO

  30. Minimize expensive intrinsic calls Untuned Tuned ------- ----- DO I=1,N DIMENSION SINX(N) DO J=1,N ... A(J,I)=B(J,I)*SIN(X(J)) DO J=1,N ENDDO SINX(J)=SIN(X(J)) ENDDO ENDDO DO I=1,N DO J=1,N A(J,I)=B(J,I)*SINX(J) ENDDO ENDDO

More Related