Computer Architecture for Medical Applications Pipelining & S ingle I nstruction M ultiple D ata – two driving f

Computer Architecture for Medical ApplicationsPipelining & SingleInstructionMultipleData – two driving factors of single core performance Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Dietmar Fey, Department for Computer Science CAMA 2013 - D. Fey and G. Wellein TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA

A different view on computer architecture CAMA 2013 - D. Fey and G. Wellein

From high level code to macro-/microcode execution sum=0.d0 do i=1, N sum=sum + A(i)enddo… Execution Compiler sumin register xmm1 A(i) (incl. LD) i(loop counter) ADD 1st argument to 2nd argument and store result in 2nd argument N ADD Execution unit CAMA 2013 - D. Fey and G. Wellein

How does high level code interact with execution units • Many hardware execution units: • LOAD (STORE) operands from L1 cache (register) to register (memory) • Floating Point (FP) MULTIPLY and ADD • Various Integer units • Execution units may work in parallel “Superscalar” processor • Two important concepts at hardware level: Pipelining + SIMD sum=0.d0 do i=1, N sum=sum + A(i)enddo… CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Pipelining CAMA 2013 - D. Fey and G. Wellein

Introduction: Moore’s law Intel Nehalem EX: 2.3 Billion nVIDIA FERMI: 3 Billion 1965: G. Moore claimed #transistors on processor chip doubles every 12-24 months CAMA 2013 - D. Fey and G. Wellein

Introduction: Moore’s law  faster cycles and beyond • Moore’s law  transistors are getting smaller  run them faster • Faster clock speed  Reduce complexity of instruction execution  Pipelining of instructions • Increasing transistor count and clock speed allows / requires architectural changes: • Pipelining • Superscalarity • SIMD / Vector ops • Multi-Core/Threading • Complex on chip caches Intel x86 clockspeed CAMA 2013 - D. Fey and G. Wellein

Pipelining of arithmetic/functional units • Idea: • Split complex instruction into several simple / fast steps (stages) • Each step takes the same amount of time, e.g. a single cycle • Execute different steps on different instructions at the same time (in parallel) • Allows for shorter cycle times (simpler logic circuits), e.g.: • floating point multiplication takes 5 cycles, but • processor can work on 5 different multiplications simultaneously • one result at each cycle after the pipeline is full • Drawback: • Pipeline must be filled - startup times (#Instructions >> pipeline steps) • Efficient use of pipelines requires large number of independent instructions  instruction level parallelism • Requires complex instruction scheduling by compiler/hardware – software-pipelining / out-of-order • Pipelining is widely used in modern computer architectures CAMA 2013 - D. Fey and G. Wellein

Interlude: Possible stage for Multiply • Real numbers can be represented as mantissa and exponent in a “normalized” representation, e.g.: s*0.m * 10e with Sign s={-1,1} Mantissa mwhich does not contain 0 in leading digit Exponent esome positive or negative integer • Multiply two real numbers r1*r2 = r3r1=s1*0.m1 * 10e1 , r2=s2*0.m2 * 10e2 :s1*0.m1 * 10e1 * s2*0.m2 * 10e2 • (s1*s2)* (0.m1*0.m2) * 10(e1+e2) • Normalize result: s3* 0.m3 * 10e3 CAMA 2013 - D. Fey and G. Wellein

Cycle: 1 2 3 4 5 6 ... N+4 Stage Separate Mant. / Exp. B(1) C(1) B(2) C(2) B(3) C(3) B(4) C(4) B(5) C(5) B(6) C(6) ... Mult. Mantissa B(1) C(1) B(2) C(2) B(3) C(3) B(4) C(4) B(5) C(5) ... Add. Exponents B(1) C(1) B(2) C(2) B(3) C(3) B(4) C(4) ... Normal. Result A(1) A(2) B(3) C(3) ... Insert Sign A(1) A(2) ... A(N) 5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N First result is available after 5 cycles (=latency of pipeline)! After that one instruction is completed in each cycle CAMA 2013 - D. Fey and G. Wellein

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N Wind-up/-down phases: Empty pipeline stages CAMA 2013 - D. Fey and G. Wellein

Pipelining: Speed-Up and Throughput • Assume a general m-stage pipe, i.e. pipelinedepthis m. Speed-uppiplinedvs non-pipelinedexecutionat same clockspeedTseq / Tpipe = (m*N) / (N+m) ~ m for large N (>>m) • Throughputofpiplinedexecution (=Average results per Cycle) executingNinstructions in pipelinewithmstages: N / Tpipe(N) = N / (N+m) = 1 / [ 1+m/N ] • Throughputfor large N: N / Tpipe(N) ~ 1 • Numberofindependentoperations (NC) requiredtoachiveTpresults per cycle:Tp= 1 / [ 1+m/NC ] NC = Tp m / (1- Tp) Tp= 0.5 NC = m CAMA 2013 - D. Fey and G. Wellein

Throughput as function of pipeline stages 90% pipeline efficiency m = #pipeline stages CAMA 2013 - D. Fey and G. Wellein

Software pipelining Assumption: Instructions block execution if operands are not available Fortran Code: do i=1,N a(i) = a(i) * c end do • Example: Latencies load a[i] Load operand to register (4 cycles)mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registersstore a[i] Write back result from register to mem./cache (2 cycles)branch.loop Increase loopcounter as long i less equal N (0 cycles) Optimized Pseudo Code: loop: load a[i+6] mult a[i+2] = c, a[i+2] store a[i] branch.loop Simple Pseudo Code: loop: load a[i] mult a[i] = c, a[i] store a[i] branch.loop CAMA 2013 - D. Fey and G. Wellein

Software pipelining a[i]=a[i]*c; N=12 Naive instruction issue Optimized instruction issue load a[1]load a[2]load a[3]load a[4]load a[5]mult a[1]=c,a[1]load a[6] mult a[2]=c,a[2]load a[7] mult a[3]=c,a[3]store a[1]load a[8] mult a[4]=c,a[4]store a[2]load a[9] mult a[5]=c,a[5]store a[3] load a[10] mult a[6]=c,a[6] store a[4] load a[11] mult a[7]=c,a[7] store a[5]load a[12] mult a[8]=c,a[8] store a[6] mult a[9]=c,a[9] store a[7] mult a[10]=c,a[10] store a[8] mult a[11]=c,a[11] store a[9] mult a[12]=c,a[12] store a[10] store a[11] store a[12] Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19 load a[1] mult a[1]=c,a[1]store a[1]load a[2]mult a[2]=c,a[2]store a[2]load a[3] Prolog Kernel Epilog T= 19 cycles T= 96 cycles CAMA 2013 - D. Fey and G. Wellein

Efficient use of Pipelining • Software pipelining can be done by the compiler, but efficient reordering of the instructions requires deep insight into application (data dependencies) and processor (latencies of functional units) • Re-ordering of instructions can also be done at runtime by out-of-order (OOO) execution • (Potential) dependencies within loop body may prevent efficient software pipelining or OOO execution, e.g.: Pseudo-Dependency: do i=1,N-1 a(i) = a(i+1) * c end do Dependency: do i=2,N a(i) = a(i-1) * c end do No dependency: do i=1,N a(i) = a(i) * c end do CAMA 2013 - D. Fey and G. Wellein

Pipelining: Data dependencies CAMA 2013 - D. Fey and G. Wellein

Pipelining: Data dependencies a[i]=a[i-1]*c; N=12 Naive instruction issue Optimized instruction issue load a[1]mult a[2]=c,a[1]mult a[3]=c,a[2]store a[2]mult a[4]=c,a[3]store a[3] mult a[5]=c,a[4] store a[4] mult a[6]=c,a[5] store a[5] mult a[7]=c,a[6] store a[6] mult a[8]=c,a[7] store a[7] mult a[9]=c,a[8] store a[8] Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19 load a[1] mult a[2]=c,a[1]store a[2]load a[2]mult a[3]=c,a[2]store a[3]load a[3] Prolog Kernel T= 26 cycles T= 96 cycles Length of MULT pipeline determines throughput CAMA 2013 - D. Fey and G. Wellein

Fill pipeline with independent recursive streams.. Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT MULT Pipeline depth: 5 stages  1 F / 5 cycles for recursive update Thread 0:do i=1,N A(i)=A(i-1)*s B(i)=B(i-1)*s C(i)=C(i-1)*s D(i)=D(i-1)*s E(i)=E(i-1)*senddo B(2)*s A(2)*s E(1)*s D(1)*s MULT pipe C(1)*s 5 independent updates on a single core! CAMA 2013 - D. Fey and G. Wellein

Pipelining: Beyond multiplication • Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs. • x86 processors (AMD, Intel): 1 Mult & Add unit per processor core • No hardware for div / sqrt / exp / sin …  expensive instructions • “FP costs” in cycles per instructionfor Intel Core2 architecture • Other instructions are also pipelined, e.g. LOAD operand to register (4 cycles) CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (1) void scale_shift(double *A, double *B, double *C, int n) { for(inti=0; i<n; ++i) C[i] = A[i] + B[i]; } • Hidden data dependencies: • C/C++ allows “Pointer Aliasing” , i.e. A  &C[-1] ; B  &C[-2] C[i] = C[i-1] + C[i-2]  Dependency! • Compiler can not resolve potential pointer aliasing conflictson its own! • If no “Pointer Aliasing” is used, tell it to the compiler, e.g. • use –fno-alias switch for Intel compiler • Pass arguments as (double *restrict A,…) (only C99 standard) CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (2) do i=1, N call elementprod(A(i),B(i), psum)C(i)=psumenddo…function elementprod( a, b, psum)…psum=a*b • Simple subroutine/function calls within a loop  Inline subroutines! (can be done by compiler….) do i=1, N psum=A(i)*B(i)C(i)=psumenddo… CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (3a) Can we use pipelining or does this cost us 8*3 cycle (assuming 3 stage ADD pipeline) CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (3b) • More general – “reduction operations”? • Benchmark: Run above assembly language kernel with N=32,64,128,…,4096 on processor with • 3.5 GHz clock speed  ClSp=3500 Mcycle/s • 1 pipelined ADD unit (latency 3 cycles) • 1 pipelined LOAD unit (latency 4 cycles) sum in register xmm1 A(i) (incl. LD) sum=0.d0 do i=1, N sum=sum + A(i)enddo… i (loop counter) ADD 1st argument to 2nd argument and store result in 2nd argument N 1 cycle per iteration(after 7 iterations) CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (4) • Expected Performance: Throughput * ClockSpeed • Throughput: N/T(N) = N/ (L+N)Assumption: L is total latency of one iteration and one result per cycle delivered after pipeline startup. Total runtime: L+N cycles • Total latency: L= 4 cycles + 3 cycles = 7 cycles • Performance for N Iterations: 3500 MHz * (N / (L+N)) Iterations/cycle • Maximum performance (): 3500 Mcycle/s * 1 Iteration/cycle=3500 Miterations/s A(i) A(i-1) LOAD A(i-2) A(i-3) A(i-4) A(i-5) ADD A(i-6) CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (5) sum=0.d0 do i=1, N sum=sum + A(i)enddo… Throughput here: N/T(N) = N/ (L+3*N) Why? A(3) A(2) s = s+A(2) s = s+A(1) Dependency on sum next instruction needs to wait for completion of previous one  only 1 out of 3 stages active  3 cycles per iteration CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (6) • Increase pipeline utilization by “loop unrolling” “2-way Modulo Variable Expansion” (N is even) sum1=0.d0 sum2=0.d0 do i=1, N, 2 sum1=sum1+A(i) sum2=sum2+A(i+1)enddo sum=sum1+sum2 2 out of 3 pipeline stages can be filled 2 results every 3 cycles 1.5 cycle/Iteration CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (7) • 4-way Modulo Variable Expansion (MVE) to get best performance (in principle 3-way should do as well) • Sum is split up in 4 independent partial sums • Compiler can do that, if he is allowedto do so… • Computer’s floating point arithmetic is not associative! • If you require binary exact result (-fp-model strict) compiler is not allowed to do this transformation • L=(7+3*3) cycles (prev. slide) “4-way MVE” Nr=4*(N/4)sum1=0.d0sum2=0.d0sum3=0.d0sum4=0.d0 do i=1, Nr, 4 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2)sum4=sum4+A(i+3)enddo do i=Nr+1, N sum1=sum1+A(i)enddo sum=sum1+sum2+sum3+sum4 Remainder loop CAMA 2013 - D. Fey and G. Wellein

Fetch Instruction 1from L1I Fetch Instruction 2from L1I Decode Instruction 1 Fetch Instruction 3from L1I Decode Instruction 2 ExecuteInstruction 1 Fetch Instruction 4from L1I Decode Instruction 3 ExecuteInstruction 2 Pipelining: The Instruction pipeline • Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps: Fetch Instructionfrom L1I Decode instruction ExecuteInstruction • Hardware Pipelining on processor (all units can run concurrently): 1 t 2 3 4 … • Branches can stall this pipeline! (Speculative Execution, Predication) • Each Unit is pipelined itself (cf. Execute=Multiply Pipeline) CAMA 2013 - D. Fey and G. Wellein

Pipelining: The Instruction pipeline Assume: Result determines next instruction! • Problem: Unpredictable branches to other instructions Fetch Instruction 1from L1I 1 t 2 Decode Instruction 1 ExecuteInstruction 1 3 Fetch Instruction 2from L1I 4 … Decode Instruction 2 ExecuteInstruction 2 Fetch Instruction 3from L1I Decode Instruction 3 CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Superscalar CAMA 2013 - D. Fey and G. Wellein

Superscalar Processors • Superscalar processors provide additional hardware (i.e. transistors) to execute multiple instructions per cycle! • Parallel hardware components / pipelines are available to • fetch / decode / issues multiple instructions per cycle(typically 3 – 6 per cycle) • perform multiple integer / address calculations per cycle(e.g. 6 integer units on Itanium2) • load (store) multiple operands (results) from (to) cacheper cycle (typically one load AND one store per cycle) • perform multiple floating point instructions per cycle(typically 2 floating point instructions per cycle, e.g. 1 MULT + 1 ADD) • On superscalar RISC processors out-of order (OOO) execution hardware is available to optimize the usage of the parallel hardware CAMA 2013 - D. Fey and G. Wellein

Fetch Instruction 4from L1I Fetch Instruction 3from L1I Fetch Instruction 2from L1I Fetch Instruction 1from L1I Fetch Instruction 2from L1I Decode Instruction 1 Fetch Instruction 2from L1I Decode Instruction 1 Fetch Instruction 2from L1I Decode Instruction 1 Fetch Instruction 5from L1I Decode Instruction 1 Fetch Instruction 3from L1I Decode Instruction 2 ExecuteInstruction 1 Fetch Instruction 3from L1I Decode Instruction 2 ExecuteInstruction 1 Fetch Instruction 3from L1I Decode Instruction 2 ExecuteInstruction 1 Fetch Instruction 9from L1I Decode Instruction 5 ExecuteInstruction 1 Fetch Instruction 4from L1I Decode Instruction 3 ExecuteInstruction 2 Fetch Instruction 4from L1I Decode Instruction 3 ExecuteInstruction 2 Fetch Instruction 4from L1I Decode Instruction 3 ExecuteInstruction 2 Fetch Instruction 13from L1I Decode Instruction 9 ExecuteInstruction 5 Superscalar Processors – Instruction Level Parallelism • Multiple units enable use of InstrucionLevel Parallelism (ILP):Instruction stream is “parallelized” on the fly • Issuing m concurrent instructions per cycle: m-way superscalar • Modern processors are 3- to 6-way superscalar & can perform 2 or 4 floating point operations per cycles 4-way „superscalar“ t CAMA 2013 - D. Fey and G. Wellein

Superscalar Processors – ILP in action “3-way Modulo Variable Expansion” (N is multiple of 3) • Complex register management not show (R4 contains A(i-4)) • 2-way superscalar: 1 LOAD instruction + 1 ADD instruction completed per cycle • Often cited metrics for superscalar processors: • Instructions Per Cycle: IPC=2 above • Cycles Per Instruction: CPI=0.5 above sum1=0.d0 !  reg. R11 sum2=0.d0 !  reg. R12 Sum3=0.d0 !  reg. R13 do i=1, N, 3 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2)enddo sum=sum1+sum2+sum3 A(i)R0 R11=R11+R4 A(i-1)R1 RegisterSet LOAD R12=R12+R5 A(i-2)R2 ADD R13=R13+R6 A(i-3)R3 CAMA 2013 - D. Fey and G. Wellein

Superscalar processor – Intel Nehalem design • Decode& issue a max. of 4 instructions per cycle: IPC=4Min. CPI=0.25 Cycles/Instruction • Parallel units: • FP ADD & FP MULT (work in parallel) • LOAD + STORE (work in parallel) • Max. FP performance:1 ADD + 1 MULT instructionper cycle • Max. performance:A(i) = r0 + r1 * B(i) • ½ of max. FP performance:A(i) = r1 * B(i) • 1/3 of max. FP performance:A(i) = A(i) + B(i) * C(i) CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Single Instruction Multiple Data (SIMD)-processing Basic Idea: Apply the same instruction to multiple operands in parallel CAMA 2013 - D. Fey and G. Wellein

SIMD-processing – Basics • Single Instruction Multiple Data (SIMD) instructions allow the concurrent execution of the same operation on “wide” registers. • x86_64 SIMD instruction sets: • SSE: register width = 128 Bit  2 double (4 single) precision FP operands • AVX: register width = 256 Bit  4 double (8 single) precision FP operands • “Scalar” (non-SIMD) execution: 1 single/double operand, i.e. only lower 64 Bit (32 Bit) of registers are used. • Integer operands: SSE can be configured very flexible: 1 x 128 bit,…,16 x 8 bit • AXV: No support for using the 256 bit register width for integer operations • SIMD-execution  vector execution • If compiler has vectorized loop  SIMD instructions are used CAMA 2013 - D. Fey and G. Wellein

SIMD-processing – Basics • Example: Adding two registers holding double precision floating point operands using 256 Bit register (AVX) • If 128 Bit SIMD instructions (SSE) are executed  only half of the registers width is used R0 R1 R2 R0 R1 R2 SIMD execution: V64ADD [R0,R1] R2 + + 256 Bit C[3] A[3] B[3] + C[2] Scalar execution: R2 ADD [R0,R1] A[2] B[2] 64 Bit + + C[1] A[1] B[1] C[0] A[0] B[0] C[0] A[0] B[0] CAMA 2013 - D. Fey and G. Wellein

SIMD-processing – Basics • Steps (done by the compiler) for “SIMD-processing” for(inti=0; i<n;i++) C[i]=A[i]+B[i]; “Loop unrolling” for(inti=0; i<n;i+=4){ C[i] =A[i] +B[i]; C[i+1]=A[i+1]+B[i+1]; C[i+2]=A[i+2]+B[i+2]; C[i+3]=A[i+3]+B[i+3];} //remainder loop handling “Pseudo-Assembler” Load 256 Bits starting from address of A[i] to register R0 LABEL1: VLOAD R0  A[i] VLOAD R1  B[i] V64ADD[R0,R1]  R2 VSTORE R2  C[i] ii+4i<(n-4)? JMP LABEL1 //remainder loop handling Add the corresponding 64 Bit entries in R0andR1and store the 4 results to R2 Store R2 (256 Bit) to address starting at C[i] CAMA 2013 - D. Fey and G. Wellein

SIMD-processing – Basics • No SIMD-processing for loops with data dependencies • “Pointer aliasing” may prevent compiler from SIMD-processing • C/C++ allows that A  &C[-1] and B  &C[-2] C[i] = C[i-1] + C[i-2]: dependency  No SIMD-processing • If no “Pointer aliasing” is used, tell it to the compiler, e.g. use –fno-alias switch for Intel compiler  SIMD-processing for(inti=0; i<n;i++) A[i]=A[i-1]*s; void scale_shift(double *A, double *B, double *C, int n) { for(inti=0; i<n; ++i) C[i] = A[i] + B[i]; } CAMA 2013 - D. Fey and G. Wellein

SIMD-processing – Basics • SIMD-processing of a vector sum double s=0.0; for(inti=0; i<n;i++) s = s + A[i]; Data dependency on s must be resolved for SIMD-processing(assume AVX) Compiler does transformation (Modulo Variable Expansion) – if programmer allows it to do so!(e.g. use –O3 instead of –O1) s0=0.0; s1=0.0; S2=0.0; S3=0.0; for(inti=0; i<n;i+=4){ s0 = s0+ A[i] ; s1 = s1+ A[i+1]; s2 = s2+ A[i+2]; s3 = s3+ A[i+3]; } //remainder s=s0+s1+s2+s3 R0 (0.d0, 0.d0, 0.d0, 0.d0) … V64ADD(R0,R1)  R0 … R0 R1 “Horizontal” ADD: Sum up the 4 64 Bit entries of R0 CAMA 2013 - D. Fey and G. Wellein

SIMD-processing: What about pipelining?! R0(0.0,0.0,0.0, 0.0) do i=1, N, 4VLOAD A(i:i+3)  R1V64ADD(R0,R1)  R0 enddosum  HorizontalADD(R0) R0(0.0,0.0,0.0,0.0) R1(0.0,0.0,0.0,0.0) R2(0.0,0.0,0.0,0.0) do i=1, N, 12 LOAD A(i:i+3)  R3 LOAD A(i+4:i+7)  R4LOAD A(i+8:i+11) R5 V64ADD(R0,R3)  R0V64ADD(R1,R4)  R1V64ADD(R2,R5)  R2 enddo… V64ADD(R0,R1)  R0V64ADD(R0,R2)  R0sum HorizontalADD(R0) A(5:8) A(9:12) Need to do another MVE step to fill pipeline stages R0=R0+A(5:8) R0=R0+A(1:4) “Vertical add” “Horizontal add” CAMA 2013 - D. Fey and G. Wellein

SIMD-processing: What about pipelining?! Double Precision 1 AVX iteration performs 4 i-Iterations (successive) Performance: 4x higher than “scalar” version Start-up phase much longer… Unrolling factor of vectorized code CAMA 2013 - D. Fey and G. Wellein

Compiler generated AVX code (loop body) Baseline version (“scalar”): No pipelining – no SIMD  3 cycles / Iteration Compiler generated “AVX version” (-O3 –xAVX): SIMD processing: vaddpd%ymm8  4 dp operands(4-way unrolling) Pipelining: 8-way MVE of SIMD code  0.25 cycles / Iteration 32-way unrolling in total CAMA 2013 - D. Fey and G. Wellein

SIMD processing – Vector sum (double precision) – 1 core SIMD: Most impact if data is close to the core – other bottlenecks stay the same! Peak Scalar: Code execution in core is slower than any data transfer Plain: No SIMD but 4-way MVE AVX/SIMD: Full benefit only if data is in L1 cache Location of “input data” (A[]) in memory hierarchy

Data parallel SIMD processing • Requires independent vector-like operations (“Data parallel”) • Compiler is required to generate “vectorized” code  Check compiler output • Check for the use of “Packed SIMD” instructions at runtime (likwid) or in the assembly code • Packed SIMD may require alignment constraint, e.g. 16-Byte alignment for efficient load/store on Intel Core2 architectures • Check also for SIMD LOAD / STORE instructions • Use of Packed SIMD instructions reduces the overall number of instructions (typical theoretical max. of 4 instructions / cycle) SIMD code may improve performance but reduce CPI! CAMA 2013 - D. Fey and G. Wellein

Data parallel SIMD processing: Boosting performance • Putting it all together: Modern x86_64 based Intel / AMD processor • One FP MULTIPLY and one FP ADD pipeline can run in parallel and have a throughput of one FP instruction/cycle each (FMA units on AMD Interlagos)  Maximum 2 FP instructions/cycle • Each pipeline operates on 128 (256) Bit registers for packed SSE (AVX) instructions  2 (4) double precision FP operations per SSE (AVX) instruction  4 (8) FP operations / cycle (1 MULT & 1 ADD on 2 (4) operands) Peak performance of 3 GHz CPU (core): SSE: 12 GFlop/s or AVX: 24 GFlop/s (double precision)SSE: 24 GFlop/s or AVX: 48 GFlop/s (single precision) • BUT for “SCALAR” code: 6 GFlop/s (double and single precision)! CAMA 2013 - D. Fey and G. Wellein

Maximum Floating Point (FP) Performance: Pcore = F * S * n F FP instructions per cycle: 2 (1 MULT and 1 ADD) S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers – “AVX”) n Clock speed : ∽2.5 GHz P = 20 GF/s (dp) / 40 GF/s (sp) There is no single driving force for single core performance! Scalar (non-SIMD) execution S = 1 FP op/instruction (dp / sp) P = 5 GF/s (dp / sp) CAMA 2013 - D. Fey and G. Wellein

SIMD registers: floatingpoint (FP) dataandbeyond • Possibledatatypes in an SSE register • AVX onlyappliesto FP data (not toscale) 16x 8bit 8x 16bit 4x 32bit integer 2x 64bit 1x 128bit 4x 32 bit floating point 2x 64 bit 8x 32 bit 4x 64 bit 50

Computer Architecture for Medical Applications Pipelining & S ingle I nstruction M ultiple D ata – two driving f