Parallel Programming on the SGI Origin2000

Parallel Programming on the SGI Origin2000 Taub Computer Center Technion Anne Weill-Zrahia With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Mar 2005

Parallel Programming on the SGI Origin2000 • Parallelization Concepts • SGI Computer Design • Efficient Scalar Design • Parallel Programming -OpenMP • Parallel Programming- MPI

3) Efficient Scalar Programming

Remember to use “make” Make: * implicit documentation * minimize compile time * 100% “oops free” * make / pmake / smake OBJS = f1.o f2.o f3.o FFLAGS = -O3 –r12k LDFLAGS = -lm –r12k All: pgm1 Pgm1: $(OBJS) <tab> f77 –o pgm1 $(OBJS) $(LDFLAGS) F2.o: f2.f <tab> f77 $(FFLAGS) –static –c f2.f Clean: <tab> -rm –f $(OBJS) pgm1 core

Speedup opportunities • Program may run slowly because not all resources are used efficiently: • on the processor: • * non optimal scheduling of instructions (too many wait states) • memory access: • * memory access pattern is not optimized for the architecture • * not all data in cache line is used (spatial locality) • * data in the cache is not reused (temporal locality) • Performance analysis is used to diagnose the problem • Compiler will attempt to optimize program. However, it might not be • possible: • * data representation can inhibit compiler optimization • * algorithm presentation can inhibit optimization • Often it is necessary to rewrite critical parts of code (loops) so that • compiler can do better performance optimization. Understanding the • optimization techniques helps the compiler to use them effectively.

Compiler optimization techniques • Here are some optimization techniques built into the compiler: • Loop based • * loop interchange • * outer and inner loop unrolling • * cache blocking • * loop fusion (merge) and fission (split) • - General • * procedure inlining • * data and array padding • The algorithm should be presented in the program in such a way • that the compiler can apply optimization techniques, leading to • best performance on the specific computer.

Some simple arithmetic replacements do i = 1,n a = sin(x(i)) v(i) = 2.0*a enddo Replace by: do i = 1,n v(i)=2.0*sin(x(i)) enddo do j = 1,m do k = 1,n v(k,j)=2.0*(a(k)/b(j)) enddo Enddo Replace by: do j = 1,m btemp = 2.0/b(j) do k = 1,n v(k,j) = btemp*a(k) enddo enddo

Array Indexing Arrays can be indexed in several ways. For example: Explicit addressing Do j=1,m do k=1,n .. A(k+(j-1)*n) .. enddo enddo Direct addressing Do j=1,m do k=1,n .. A(k,j) .. enddo enddo Loop carried addressing Do j=1,m do k-1,n kk=kk+1 .. A(kk) .. enddo enddo Indirect addressing Do j=1,m do k=1,n .. A(index(k,j) enddo enddo • The addressing scheme will have an impact on performance • Arrays should be accessed in most natural direct way for compiler • to apply loop optimization techniques

Data storage in memory Data storage is language dependent: Fortran stores multi-dimensional arrays in column order j i In memory: leftmost index changes first a(i,j) i i i j j+1 j+2 C stores multi-dimensional arrays in row order i j In memory: rightmost index changes first A[i][j] j j j i i+2 i+1 For arrays that do not fit in the cache, accessing elements in storage order gives much faster performance

Fortran loop interchange Interchanged loop Do j=1,m do i=1,n c(i,j)=a(i,j)+b(i,j) enddo enddo Original loop do i=1,n do j=1,m c(i,j)=a(i,j)+b(i,j) enddo enddo m j i m j i n n Storage order Access order The distribution of data in memory is not changed, only the access pattern changes The compiler can do this automatically, but there are complicated cases

Index reversal Original loop do i=1,n do j=1,m c(i,j)=a(i,j)+b(j,i) enddo enddo The access is wrong for A and C, but it is right for B. Interchange will be good for A and C, but bad for B. Possible solution: Index reversal of B – that is, b(i,j) is replace By b(j,i). But this must be done everywhere in the program. (It must be done manually, the compiler will not do it.) Interchanged loop + index reversal do j=1,m do i=1,n c(i,j)=a(i,j)+b(i,j) enddo enddo

Loop interchange in C In C, the situation is the opposite of what it is in Fortran Original loop: for(j=0;j<m;j++) for(i=0;i<n;i++) c[i][j]=a[i][j]+b[j][i]; Addressing of c and a are wrong Addressing of b is correct Index reversal loop: for(j=0;j<m;j++) for(i=0;i<n;i++) c[j][i]=a[j][i]+b[j][i]; Interchanged loop: for(i=0;i<n;i++) for(j=0;j<m;j++) c[i][j]=a[i][j]+b[j][i]; The performance benefits in C are the same as in Fortran. In most practical situations, loop interchange (supported by the Compiler) is easier to achieve than index reversal.

Loop fusion Loop fusion (merging two or more loops together) • fusing loops that refer to the same data enhances temporal locality • larger loop body allow more effective scalar optimizations and • instruction scheduling More fusion, peeling a[0]=b[0]+1; c[0]=a[0]/2; for (i=1;i<n;i++) { a[i]=b[i]+1; c[i]=a[i]/2; d[I-1]=1/c[i]; } d[n]=1/c[n+1]; Fused loops: for (i=0;i<n;i++) { a[i]=b[i]+1; c[i]=a[i]/2; } for (i=0;i<n;i++) d[i]=1/c[i+1]; Original loops: for (i=0;i<n;i++) a[i]=b[i]+1; for (i=0;i<n;i++) c[i]=a[i]/2; for (i=0;i<n;i++) d[i]=1/c[i+1]; • loop peeling can bread data dependencies when fusing loops • sometimes temporary arrays can be replaced by scalars (manual only) • compiler will attempt to fuse loops if they are adjacent, that is, no code • between the loops to be fused

Loop fission Loop fission (splitting) or loop distribution Improve memory locality by splitting out loops that refer to different independent arrays for (i=0;i<n-1;i++) { b[i+1]=c[i]*x+y; c[i+1]=1/b[i+1]; } for (i=0;i<n-1;i++) a[i+1]=a[i+1]+b[i]; for (i=0;i<n-1;i++) d[i+1]=sqrt(c[i+1]); I=n+1; for (i=1;i<n;i++) { a[i]=a[i]+b[i-1]; b[i]=c[i-1]*x+y; c[i]=1/b[i]; d[i]=sqrt(c[i]); }

Array placement effects “Wrong” data placement in memory can lead to an effect of cache thrashing. The compiler has two techniques built in to avoid thrashing: - array padding - leading dimension extension In principle: leading dimension of arrays should be an odd number - that is: if a multi-dimensional array has small dimensions (such as a(32,32,32)) the leading dimensions should be odd numbers – never a power of 2.

Single CPU RISC memory levels CPU Cache Main memory

RISC memory levels Single CPU CPU Cache Main memory

Direct mapped cache: thrashing common // a(4096),b(4096) do i=1,n prod = prod + a(i)*b(i) enddo Virtual memory A(1) A(2) A(4095) A(4096) B(1) B(2) B(4095) B(4096) Direct mapped cache: (16KB) Cache line: 4 words 16 KB A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8) A(4089) A(4090) A(4091) A(4092) A(4093) A(4094) A(4095) A(4096) Thrashing: every memory reference is a cache miss The rule: avoid leading dimensions that are a power of 2 !

Array Padding: Example common // a(1024,1024),b(1024,1024),c(1024,1024) do j=1,1024 do i=1,1024 a(i,j)=a(i,j)+b(i,j)*c(i,j) enddo enddo Addr[C(1,1)] = Addr[B(1,1)} + 1024*1024*4 Position in the cache: C(1,1)=B(1,1), since (1024*1024*4)mod32K=0 common // a(1024,1024),pad1(129), b(1024,1024),pad2(129), c(1024,1024) do j=1,1024 do i=1,1024 a(i,j)=a(i,j)+b(i,j)*c(i,j) enddo enddo • Padding will cause cache lines • to be placed in different places • Compiler will try to do padding • automatically Addr[C(1,1)] = Addr[B(1,1)} + 1024*1024*4+129*4 Position in the cache: C(1,1)=B(129,1)

Dangers of array padding • Compiler will automatically pad local data • At “-O3” optimization, compiler will pad common blocks • * all routines with common blocks must be compiled with “-O3” • otherwise compiler will not perform this optimization • * padding of common blocks is safe as long as the Fortran standard • is not violated: subroutine sub Common // a(512,512),b(512,512) Do i=1,2*512,512 a(i)=0.0 enddo return end • The remedy is to fix violation, or not to use this optimization – • either by compiling with lower optimization or using compiler flag: • -OPT:reorg_common=off

Loop unrolling Loop unrolling: perform multiple loop iterations at the same time do i=1,n,unroll ..(i).. ..(i+1).. ..(i+2).. .. ..(i+unroll-1).. enddo do i=1,n,1 ..(i).. enddo Advantages of loop unrolling: * more opportunities for super-scalar code * more data reuse * exploit presence of cache lines * reduction in loop overhead Disadvantages of loop unrolling: * cleanup code required Cleanup: do i=n-mod(n,unroll)+1,n ..(i).. enddo NOTE: the compiler will unroll code automatically based on an estimate of how much time the loop body will take

Blocking for cache (tiling) Blocking for cache: * an optimization that is good for data sets that to not fit into the data cache * a way to increase spatial locality of reference (that is, exploit full cache lines) * a way to increase temporal locality of reference (that is, to improve data reuse) * it is mostly beneficial with multi-dimensional arrays do i=1,n ..(i).. enddo Only “nb” elements at a time of the inner loop are activated do il=1,n,nb do i=il,min(il+nb-1,n) ..(i).. enddo enddo

Principle of blocking do i=1,n a(i)=i enddo do i1=1,n,iblk i2=i1+iblk-1 do i=i1,i2 a(i)=i enddo enddo do i1=1,n,iblk i2=i1+iblk-1 if (i2.gt.n) i2=n do i=i1,i2 a(i)=i enddo enddo

Blocking example: transpose do j=1,n do i=1,n a(i,j)=b(j,i) enddo enddo Either A or B is accessed in non-unit stride: bad reuse of data Blocking the loops for cache will do the transpose block by block, reusing the elements in the blocks do jt=1,n,jtblk do it=1,n,itblk do j=jt,jt+jtblk-1 do i=it,it+itblk-1 a(i,j)=b(j,i) enddo enddo enddo enddo

A Recent Example Matrix multiply

Matrix Multiply Remove “if” from loop

Profile -- original

Profile – move “if” statement

Exercise 2 -- Matrix loop order

Procedure inlining Inlining: replace a function (or subroutine) call by source do i=1,n call dowork(a(i),c(i)) enddo subroutine dowork(x,y) y=1.0+x*(1.0+x*.5) end Advantages: * increase opportunities for optimizations * more opportunities for loop nest optimizations * reduce call overhead (minor) do i=1,n c(i)=1.0*a(i)*(1.0+a(i)*0.5) enddo Inhibitions to inlining: * mismatched arguments (type or shape) * no inlining across languages (C to fortran) * so static (SAVE) variables * no recursive routines * no functions with alternate entries * no nested subroutines (as in F90) Candidates for inlining are modules that: * are “small” – not much source code * are called many times (say in a loop) * do not take much time per call

A simple matrix multiplication (triple-nested loop) subroutine mm(m,n,p,a,lda,b,ldb,c,ldc) integer m,n,p,lda,ldb,ldc dimension a(lda,p),b(ldb,n),c(ldc,m) do i=1,m do j=1,n do k=1,p c(i,j)=c(i,j)+a(i,k)*b(k,j) end do end do end do Try to speed it up!

1) Loop reversal do i=1,m do j=1,n do k=1,p c(i,j)=c(i,j)+a(i,k)*b(k,j) end do end do end do Loop constant do j=1,n do k=1,p t=b(k,j) do i=1,m c(i,j)=c(i,j)+a(i,k)*t end do end do end do Reverse loop order

2) Inner loop unrolling do j=1,n do k=1,p t=b(k,j) do i=1,m c(i,j)=c(i,j)+a(i,k)*t end do end do end do do j=1,n do k=1,p t=b(k,j) do i=1,(m-4)+1,4 c(i+0,j)=c(i+0,j)+a(i+0,k)*t c(i+1,j)=c(i+1,j)+a(i+1,k)*t c(i+2,j)=c(i+2,j)+a(i+2,k)*t c(i+3,j)=c(i+3,j)+a(i+3,k)*t end do do i=i,m c(i,j)=c(i,j)+a(i,k)*t end do end do end do Unroll inner loop Cleanup loop (1) Reduces loop overhead (2) Sometimes improves data reuse

3) Middle loop unrolling do j=1,n do k=1,(p-4)+1,4 t0=b(k+0,j) t1=b(k+1,j) t2=b(k+2,j) t3=b(k+3,j) do i=1,m c(i,j)=c(i,j)+a(i,k+0)*t0 $ +a(i,k+1)*t1 $ +a(i,k+2)*t2 $ +a(i,k+3)*t3 end do end do do k=k,p t0=b(k,j) do i=1,m c(i,j)=c(i,j)+a(i,k)*t0 end do end do end do do j=1,n do k=1,p t=b(k,j) do i=1,m c(i,j)=c(i,j)+a(i,k)*t end do end do end do Unroll middle loop Cleanup loop (1) Fewer c(i,j) load/store operations (2) Better locality of b(k,j) references

4) Outer Loop Unrolling do j=1,(n-4)+1,4 do k=1,p t0=b(k,j+0) t1=b(k,j+1) t2=b(k,j+2) t3=b(k,j+3) do i=1,m c(i,j+0)=c(i,j+0)+a(i,k)*t0 c(i,j+1)=c(i,j+1)+a(i,k)*t1 c(i,j+2)=c(i,j+2)+a(i,k)*t2 c(i,j+3)=c(i,j+3)+a(i,k)*t3 end do end do end do do j=j,n do k=1,p t0=b(k,j) do i=1,m c(i,j)=c(i,j)+a(i,k)*t0 end do end do do j=1,n do k=1,p t=b(j,k) do i=1,m c(i,j)=c(i,j)+a(i,k)*t end do end do end do Unroll outer loop Cleanup loop Improvement because of reuse of a(i,k) in the loop

5) Loop Blocking do j1=1,n,jblk j2=j1+jblk-1 if (j2.gt.n) j2=n do k1=1,p,kblk k2=k1+kblk-1 if (k2.gt.p) k2=p do i1=1,m,iblk i2=i1+iblk-1 if (i2.gt.m) i2=m do j=j1,j2 do k=k1,k2 t=b(k,j) do i=i1,i2 c(i,j)=c(i,j)+a(i,k)*t end do end do end do end do end do end do do j=1,n do k=1,p t=b(k,j) do i=1,m c(i,j)=c(i,j)+a(i,k)*t end do end do end do Original code Blocked loop Improves locality of reference (removes out-of-cache memory references)

Optimizations - summary Scalar optimization: * improving memory table access by code transformation and grouping independent instructions * improving memory access by modifying loop nests to take better advantage of memory hierarchy Compilers are good at instruction level optimizations and loop transformations. However, there are differences in the languages * F77 is easiest for compiler to work with * C is more difficult * F90/C++ are most complex for optimizing The user is responsible for presenting the code in a way that allows compiler optimizations * Don’t violate the language standard * write clean and clear code * consider the data structures for (false) sharing and alignment * consider the data structures in terms of data dependencies * use most natural presentation of algorithms in multi-dimensional arrays

Exercise 3 -- loop unroll/block

Compiler switches and options The compiler is the primary tool of program optimization * structure of the compiler and the compilation process * compiler optimizations * steering the compilation – compiler options * structure of the run time libraries and scientific libraries * computational domain and computation accuracy

The Compiler The compiler manages the resources of the computer: * registers * integer/floating-point execution units * load/store/prefetch for data flow in/out of processor * knowledge of the implementation details of processor and system architecture are built into the compiler User program (C/C++/Fortran) High level representation intermediate representation Low level representation Solving: * data dependencies * control flow dependencies * parallelism * compacting the code * optimal scheduling Machine instructions

MIPSpro compiler components source front-end (source to WHIRL) F77/f90 cc/CC code generator inter- procedure analyzer loop nest analyzer parallel optimizer linker executable object * There are no source-to-source optimizers or compilers * source code is translated to WHIRL intermediate language - same intermediate for different levels of interpretation - WHIRL2F and WHIRL2C translate back into Fortran or C * Inter-Procedural analyzer requires final translation at link time

Compiler optimizations - Loop nest optimizer: * loop unrolling * loop interchange * loop fusion/fission * loop blocking * memory prefetch * padding local variables - Code generation: * inner loop unrolling * if-conversion * read/write optimization * recurrence breaking * instruction scheduling - Global optimizer: * dead code elimination * copy propagation * loop normalization * memory alias analysis * strength reduction - Inter-Procedural analyzer: * cross-file function inlining * dead function elimination * dead variable elimination * padding common variables * constant propagation - Automatic parallelizer: * loop level work distribution

SGI archtecture, ABI, languages - Instruction Set Architechture (ISA): * mips4 (r10000, r12000, r14000 processors) * mips3 (r4400) * mips2 (r3000, r4000), uses old compilers - ABI (Application Binary Interface) * -n32 (32 bit pointers, 4 byte integers, 4 byte real) * -64 (64 bit pointers, 4 byte integers, 4 byte real) Languages: - Fortran 77 - Fortran 90 - C - C++

Optimization levels -O0 turn off all optimizations -O1 only local optimizations -O2 or –O extensive but conservative optimizations -O3 extensive optimizations, sometimes introduces errors -ipa inter-procedural analysis (-O2 and –O3 only) -pfa or –mp automatic parallelization -g0 or –g3 debugging switch (g0 forces –O0, g3 to debug with –O3)

Compiler man pages Primary man pages: man f77(1) f90(1) cc(1) CC(1) ld(1) Some of the option groups are large and have been given their own man pages: man opt(5) man lno(5) man ipa(5) man DEBUG_GROUP man mp(3F) man pe_environ(5) man sigfpe(3C)

Options: ABI and ISA Option Functionality -n32 MIPSpro compiler, 32 bit addressing -64 MIPSpro compiler, 64 bit addressing -o32/-32 old ucode compiler, 32 bit addressing -mips[1234] ISA; -MIPS[12] implies old ucode Two other ways to define ABI and ISA: * environment variable SGI_ABI can be set to –n32 or –64 * ABI/ISA/processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults, The location of the file can also be defined by COMPILER_DEFAULTS_PATH environment variable. Typical line in default file: DEFAULT:abi-n32:isa=mips4:proc=r14000:arit=3:opt=O3 For example: f77 –o prog –n32 –mips4 –r12000 –O3 source.f

Some compiler options option functionality -d8/d16 double precision variables as 8 or 16 bytes -r8 REAL is REAL*8 and COMPLEX is COMPLEX*16 (explicit sizes are preserved – REAL*4 remains 32bit) -i8 convert INTEGER to INTEGER*8 and LOGICAL to 8 bytes -static local variables will be initialized in fixed locations -col[72|120] source line is 72 or 120 columns -g or –g3 create symbol table for debugging -Dname define name for the pre-processor -Idir define include directory dir -alignN force alignment on bit boundary N=8,16,etc -version show compiler version -show compiler in verbose mode, display all switches

Parallel Programming on the SGI Origin2000

Parallel Programming on the SGI Origin2000

Presentation Transcript

BSP on the Origin2000

Parallel Programming

Programming the Origin2000 with OpenMP: Part II

PARALLEL programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

A comparison of CC-SAS, MP and SHMEM on SGI Origin2000

Special Lectures on Parallel Programming

Parallel Programming

Parallel Programming on Computational Grids

Parallel Programming On the IUCAA Clusters

Parallel Programming

BSP on the Origin2000

Parallel Programming on the SGI Origin2000

Parallel Programming

Parallel/Concurrent Programming on the SGI Altix

Parallel Programming on Computational Grids