Enhancing Performance with SIMD: A Deep Dive into Vector Processing

Lecture16 SSE vectorprocessing SIMD MultimediaExtensions

Improving performance withSSE • We’ve seen how we can apply multithreading to speed up the cardiacsimulator • But there is another kind of parallelism available to us:SSE Scott B. Baden / CSE 160 /Wi '16

Hardware ControlMechanisms Flynn’s classification(1966) How do the processors issueinstructions? PE+ CU Interconnect SIMD: Single Instruction, Multiple Data Execute aglobalinstructionstreaminlock-step PE+ CU PE+ CU PE+ CU PE PE+ CU Interconnect PE MIMD: Multiple Instruction, MultipleData Control Unit PE Clusters and servers processors execute instruction streamsindependently PE PE 26 Scott B. Baden / CSE 160 /Wi '16

SIMD (Single Instruction Multiple Data) • Operateonregulararraysofdata • Two landmark SIMDdesigns • ILIAC IV(1960s) • Connection Machine 1 and 2(1980s) • Vectorcomputer:Cray-1(1976) • IntelandotherssupportSIMDfor multimedia andgraphics • SSE • Streaming SIMD extensions,Altivec • Operations defined onvectors • GPUs,CellBroadbandEngine (SonyPlaystation) • Reducedperformanceondatadependent or irregularcomputations 1 1 1 4 2 2 = * 2 2 1 6 3 forall i =0:N-1 p[i] = a[i] *b[i] 2 forall i = 0 :n-1 x[i] = y[i] + z [ K[i] ] endforall forall i = 0 :n-1 if(x[i]< 0) then y[i] =x[i] else y[i]= x[i] endif endforall 27 Scott B. Baden / CSE 160 /Wi '16

AreSIMDprocessorsgeneralpurpose? A.Yes B.No Scott B. Baden / CSE 160 /Wi '16

Whatkindofparallelismdoesmultithreading provide? A. MIMD B. SIMD Scott B. Baden / CSE 160 /Wi '16

Streaming SIMDExtensions • SIMD instruction set on shortvectors • SSE: SSE3 on Bang, but most will need only SSE2 See https://goo.gl/DIokKjand • https://software.intel.com/sites/landingpage/IntrinsicsGuide • Bang : 8x128 bit vector registers (newer cpus have16) for i = 0:N-1 { p[i] = a[i] *b[i];} 1 1 1 a 4 2 2 = * b 2 2 1 X X X X 6 3 2 p 4 doubles 8 floats , intsetc Scott B. Baden / CSE 160 /Wi '16

SSE Architecturalsupport • SSE2,SSE3, SSE4,AVX • Vector operations on short vectors: add, subtract, 128 bit loadstore • SSE2+: 16 XMM registers (128bits) • These are in addition to the conventional registers and are treatedspecially • Vector operations on short vectors: add, subtract, Shuffling (handlesconditionals) • Data transfer: load/store • See the Intel intrisicsguide: • software.intel.com/sites/landingpage/IntrinsicsGuide • May need to invoke compiler options depending on level of optimization Scott B. Baden / CSE 160 /Wi '16

C++intrinsics • C++ functions and datatypes that map directly onto 1 or more machineinstructions • Supported by all majorcompilers • The interface provides 128 bit data types and operations on those datatypes • _m128(float) • _m128d(double) • Data movement andinitialization • mm_load_pd (alignedload) • mm_store_pd • mm_loadu_pd (unalignedload) • Data may need to bealigned m128d vec1,vec2,vec3; for (i=0; i<N; i+=2){ vec1 =_mm_load_pd(&b[i]); vec2 =_mm_load_pd(&c[i]); vec3 = _mm_div_pd(vec1, vec2); vec3 =_mm_sqrt_pd(vec3); _mm_store_pd(&a[i],vec3); } Scott B. Baden / CSE 160 /Wi '16

How do wevectorize? • Originalcode • double a[N], b[N],c[N]; • for (i=0; i<N; i++) { a[i] = sqrt(b[i] /c[i]); • Identify vector operations, reduce loopbound • for (i = 0; i < N;i+=2) • a[i:i+1] = vec_sqrt(b[i:i+1] /c[i:i+1]); • The vectorinstructions • __m128dvec1,vec2,vec3; for (i=0; i<N; i+=2){ • vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3=_mm_div_pd(vec1,vec2); vec3 =_mm_sqrt_pd(vec3); • _mm_store_pd(&a[i],vec3); • } Scott B. Baden / CSE 160 /Wi '16

Performance • Without SSE vectorization : 0.777sec. • With SSEvectorization: 0.454sec. • Speedup due to vectorization:x1.7 • $PUB/Examples/SSE/Vec double*a,*b,*c m128d vec1,vec2,vec3; for(i=0;i<N;i+=2){ vec1 = _mm_load_pd(&b[i]); vec2 =_mm_load_pd(&c[i]); vec3=_mm_div_pd(vec1,vec2); vec3 =_mm_sqrt_pd(vec3); _mm_store_pd(&a[i],vec3); } double *a, *b, *c for(i=0;i<N;i++){ a[i] = sqrt(b[i] /c[i]); } Scott B. Baden / CSE 160 /Wi '16

The assemblercode double *a, *b,*c __m128dvec1,vec2,vec3; for (i=0; i<N; i+=2){ vec1 = _mm_load_pd(&b[i]); vec2 = _mm_load_pd(&c[i]); vec3=_mm_div_pd(vec1,vec2); vec3 =_mm_sqrt_pd(vec3); _mm_store_pd(&a[i],vec3); } double*a,*b,*c for(i=0;i<N;i++){ a[i] = sqrt(b[i] /c[i]); } .L12: movsd divsd sqrtsd xmm0, QWORD PTR [r12+rbx] xmm0, QWORD PTR[r13+0+rbx] xmm1,xmm0 ucomisdxmm1,xmm1//checksforillegalsqrt jp .L30 movsd QWORD PTR[rbp+0+rbx],xmm1 add cmp jne rbx,8 #ivtmp.135 rbx,16384 .L12 Scott B. Baden / CSE 160 /Wi '16

What preventsvectorization • Interrupted flow out of theloop • for (i=0; i<n; i++){ • a[i] = b[i] +c[i]; • maxval = (a[i] > maxval ? a[i] :maxval); if (maxval > 1000.0)break; • } • Loop not vectorized/parallelized: multipleexits • This loop willvectorize • for (i=0; i<n; i++){ • a[i] = b[i] +c[i]; • maxval = (a[i] > maxval ? a[i] :maxval); • } Scott B. Baden / CSE 160 /Wi '16

SSE2 Cheatsheet (load andstore) xmm: one operand is a 128-bit SSE2register mem/xmm: other operand is in memory or an SSE2register {SS} Scalar SingleprecisionFP: one 32-bit operand in a 128-bitregister {PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister {SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register {PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister {A} 128-bit operand is aligned inmemory {U} the 128-bit operand is unaligned inmemory {H} move the high half of the 128-bitoperand Krste Asanovic & Randy H.Katz {L} move the low half of the 128-bitoperand Scott B. Baden / CSE 160 /Wi '16

Enhancing Performance with SIMD: A Deep Dive into Vector Processing