650 likes | 793 Views
Discover the power of Visual C++ 2012 to fully utilize massive hardware resources, from Ivy Bridge's 1.4 billion transistors to Tegra 3's multiple cores. This presentation by experts Jim Radigan and Don McCrady dives deep into how C++ can exploit hardware capabilities through advanced techniques like PPL (Parallel Patterns Library) and AMP (Accelerated Massive Parallelism). Learn about forms of parallelism—super scalar, vector-based, and more—to enhance your code's performance. Join us to explore native C++ renaissance and the future of programming optimization! ###
E N D
It’s all about performance: Using Visual C++ 2012 to maximize your hardware Jim Radigan - DevLead/Architect C++ Optimizer Don McCrady - Dev Lead C++ Amp #211 3-013
Mission:Go under the coversMake folks aware of massive hardware resources Then how C++ exploits it.….By covering PPL, Amp or “doing nothing” !
Ivy Bridge 1.4 Billion Transistors
Going Native • You’ve been hearing about native C++ Renaissance • This is what its all about – exploiting the harware
Ivy Bridge C++ PPL AMP
Agenda $87.7 B • 1. Hardware • 2. C++ auto vec+par • 3. C++ PPL • 4. C++ AMP $100 .0B +
Hardware – Forms of Parallelism • Super Scalar • Vector • Vector + Parallel • SPMD
Super Scalar – instruction level parallelism • 20% of ILP resides in a basic block • 60% resides across two adjacent “basic blocks” • 20% is scattered though the rest of the code
Super Scalar - needs speculative execution bar; 140: r0 = 42 141: r1 = r0 + r3 142: M[r1] = r1 143: zflag = r3 – r2 144: jz foo 145 … … … 188: 189: foo; 190: r4 = 0 191: r5 = 0 192: r6 = 0 193: M[r1] = 0
Super Scalar - Path of certain execution foo; 190: r4 = 0 bar; 140: r0 = 42 141: r1 = r0 + r3 142: M[r1] = r1 143: zflag = r3 – r2 144: jz foo 191: r5 = 0 192: r6 = 0 193: M[r1] = 0 When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative
Super Scalar – enables C++ vectorization VOID FOO (int *A, int * B, int *C) { IF ( ( _ISA_AVAILABE == 2) … SSE 4.2 ? && ( &A[1000] < &B[0] ) … Pointer overlap && ( &A[1000] < &C[0] ) ) { … FAST VECTOR/PARALLEL LOO P … ELSE … SEQUENTIAL LOOP …
Vector “addps xmm1, xmm0 “ xmm0 xmm1 + xmm1
VECTOR (N operations) SCALAR (1 operation) v2 v1 r2 r1 + + r3 v3 vector length add r3, r1, r2 vadd v3, v1, v2 Vector - CPU
0 1 2 3 4 5 6 7 threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … Arrays of Parallel Threads - SPMD • All threads run the same code (SPMD) • Each thread has an ID that it uses to compute memory addresses and make control decisions
Agenda • Hardware • 2. C++ auto vec+par • 3. C++ PPL • 4. C++ AMP
C++ Vectorizer – VS2012 Compiler Super Scalar Vector Vector + Parallel
Simple vector add loop for (i = 0; i < 1000/4; i++){ movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } for (i = 0; i < 1000; i++) A[i] = B[i] + C[i]; Compiler look across loop iterations !
Compiler or “Do it yourself” C++ void add(float* A, float* B, float* C, int size) { for (inti = 0; i < size/4; ++i) { p_v1 = _mm_loadu_ps(A); p_v2 = _mm_loadu_ps(B); res = _mm_sub_ps(p_v1,p_v2); _mm_store_ps(C,res); …. C++ or Klingon
Vector - all loads before all stores “addps xmm1, xmm0 “ xmm0 xmm1 + xmm1
Legal to vectorize ? FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1) Not Equal !! A (2:5) = A (1:4) + A (3:7) A(3) = ?
Vector Semantics • ALL loads before ALL stores A (2:4) = A (1:4) + A (3:7) VR1 = LOAD(A(1:4)) VR2 = LOAD(A(3:7)) VR3 = VR1 + VR2 // A(3) = F (A(2) A(4)) STORE(A(2:4)) = VR3
Vector Semantics • Instead - load store load store ... FOR ( j = 2; j <= 257; j++) A( j ) = A( j-1 ) + A( j+1 ) A(2) = A(1) + A(3) A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) ) A(4) = A(3) + A(5) A(5) = A(4) + A(6) …
Doubled the optimizer A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)
for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r; r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z; float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared; float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube; acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s; } Legal vect+par? Complex C++ Not just arrays!
Hard! Compiler reports why it failed to vectorize or parallelize cl /Qvect-report:2 /O2 t.cpp cl /Qpar-report:2 /O2 t.cpp
Parallelism + vector void foo() { CompilerParForLib(0, 1000, 4, &foo$par1, A, B, C); } foo$par1(int T1, int T2, int *A, int *B, int *C) { for (int i=T1; i<T2; i+=4) movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } void foo() { #pragmaloop(hint_parallel(4)) for (int i=0; i<1000; i++) A[i] = B[i] + C[i]; } • foo$par1(0, 249, A, B, C); core 1 instr • foo$par1(250, 499, A, B, C); core 2 instr • foo$par1(500, 749, A, B, C); core 3 instr • foo$par1(750, 999, A, B, C); core 4 instr Runtime Vectorized+ and parallel
The Bigger Picture VECT UNIT SCLR UNIT SCLR UNIT VECT UNIT VECT UNIT SCLR UNIT SCLR UNIT VECT UNIT
Vector + parallel DemoDev10/Win7 - fully optimizednovec_concrt.aviDev11/Win8 – fully optimizedvec_omp.avi
Not your grandfather’s vectorizer for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if(dc[k] < -INFTY) dc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if(dc[k] < -INFTY) dc[k] = -INFTY; if(k < M) { ic[k] = mpp[k] + tpmi[k]; if((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if(ic[k] < -INFTY) ic[k] = -INFTY; } } for(k = 1; k <= M; k++) { if(k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } } for(k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; }
Vector control flow algebra • ID source code that looks like this (in a loop): if (X > Y) { Y = X; } • vectorizer could create: Y = MAX(X, Y)
Vector control flow “pmax xmm1, xmm0 “ xmm0 xmm1 pmax xmm1
Not your grandfather’s vectorizer if ( __isa_availablility > SSE2 && NO_ALIASISIN ) { for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; mc[k] = MAX(ip[k-1] + tpim[k-1], mc[k]) mc[k] = MAX (dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = MAX (xmb + bp[k], mc[k]) mc[k] = MAX (mc[k], -INFTY) } for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; dc[k] = MAX (mc[k-1] + tpmd[k-1],dc[k]) dc[k] = MAX (dc[k], -INFTY) } for (k = 1; k <= M; k++) { ic[k] = mpp[k] + tpmi[k]; ic[k] = MAX (ip[k] + tpii[k],ic[k]) ic[k] += is[k]; ic[k] = MAX (ic[k],-INFTY) } } Vector loop Scalar loop Vector loop
Vector Math Libary 15x faster
Vectorization – targeting vector library • for (i=0; i<n; i++) { • a[i] = a[i] + b[i]; • a[i] = sin(a[i]); • } NEW Run-Time Library HW SIMD instruction • for(i=0; i<n; i=i+VL) { • a(i: i+VL-1) = a(i: i+VL-1) + b(i: i+VL-1); • a(i: i+VL-1) = _svml_Sin(a(i: i+VL-1)); • }
Parallel and vector – on by default for (inti = 0; i < _countof(a); ++i) { float dp = 0.0f; for (int j = 0; j < _countof(a); ++j){ float fj = (float)j; dp += sin(fj) * exp(fj); } a[i] = dp; }
Pragma Foo (float *a, float *b, float *c) { #pragma loop(hint_parallel(N)) for (auto i=0; i<N; i++) { *c++ = (*a++) * bar(b++); }; Use simple directives Pointers and procedure calls with escaped pointers prevent analysis for auto-parallelization
16x speedup – unmodified C++ • Scheduling • Static • Dynamic
…and compiler selects scheduler strategy • for (int l = top; l < bottom; l++){ • for (int m = left; m < right; m++ ){ • int y = *(blurredImage + (l*dimX) +m); • ySourceRed += (unsigned int) (y & 0x00FF0000) >> 16; • ySourceGreen += (unsigned int) (y & 0x0000ff00) >> 8; • ySourceBlue += (unsigned int) (y & 0x000000FF); • averageCount++; • } • }
Software – “no magic bullet” C++ PPL- cpu parallel_for (0, 1000, 1, [&](inti) { A[i] = B[i] + C[i]; } ); C++ AMP - gpu parallel_for_each ( e, [&] (index<2> idx) restrict(amp) { c[idx] = b[idx] + a[idx]; } ); copy(c,pC); C++ vectorizer -cpu for (int i=0; i<1000; i++) A[i] = B[i] + C[i];
Software C++ PPL parallel_for (0, 1000, 1, [&](inti) { } ); C++ AMP parallel_for_each( e, [&] (index<2> idx) restrict(amp) { } ); copy(c,pC); C++ vectorizer for (int i=0; i<1000; i++)
Built with C++ • Windows 8 SQL Office • Mission critical correctness and compile time
PPL for C++ • Parallel Programming Libaray
3 PPL constructs – simple but huge value • parallel_invoke( • [&]{quicksort(a, left, i-1, comp);}, • [&]{quicksort(a, i+1, right, comp);} ); parallel_for (0, 100, 1, [&](inti) { /* …*/ } ); vector<int> vec; parallel_for_each (vec.begin(), vec.end(), [&](int& i) { /* ... */ });