1 / 65

It’s all about performance: Using Visual C++ 2012 to maximize your hardware

It’s all about performance: Using Visual C++ 2012 to maximize your hardware. Jim Radigan - Dev Lead/Architect C ++ Optimizer Don McCrady - Dev Lead C++ Amp # 211 3-013.

ruana
Download Presentation

It’s all about performance: Using Visual C++ 2012 to maximize your hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. It’s all about performance: Using Visual C++ 2012 to maximize your hardware Jim Radigan - DevLead/Architect C++ Optimizer Don McCrady - Dev Lead C++ Amp #211 3-013

  2. Mission:Go under the coversMake folks aware of massive hardware resources Then how C++ exploits it.….By covering PPL, Amp or “doing nothing” !

  3. 3,100,000 Transistors

  4. Ivy Bridge 1.4 Billion Transistors

  5. TEGRA 3 - 5 cores / 128 bit vector instructions

  6. Going Native • You’ve been hearing about native C++ Renaissance • This is what its all about – exploiting the harware

  7. Ivy Bridge C++  PPL  AMP

  8. Agenda $87.7 B • 1. Hardware • 2. C++ auto vec+par • 3. C++ PPL • 4. C++ AMP $100 .0B +

  9. Hardware – Forms of Parallelism • Super Scalar • Vector • Vector + Parallel • SPMD

  10. Super Scalar

  11. Super Scalar – instruction level parallelism • 20% of ILP resides in a basic block • 60% resides across two adjacent “basic blocks” • 20% is scattered though the rest of the code

  12. Super Scalar - needs speculative execution bar; 140: r0 = 42 141: r1 = r0 + r3 142: M[r1] = r1 143: zflag = r3 – r2 144: jz foo 145 … … … 188: 189: foo; 190: r4 = 0 191: r5 = 0 192: r6 = 0 193: M[r1] = 0

  13. Super Scalar - Path of certain execution foo; 190: r4 = 0 bar; 140: r0 = 42 141: r1 = r0 + r3 142: M[r1] = r1 143: zflag = r3 – r2 144: jz foo 191: r5 = 0 192: r6 = 0 193: M[r1] = 0 When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative

  14. Super Scalar – enables C++ vectorization VOID FOO (int *A, int * B, int *C) { IF ( ( _ISA_AVAILABE == 2) … SSE 4.2 ? && ( &A[1000] < &B[0] ) … Pointer overlap && ( &A[1000] < &C[0] ) ) { … FAST VECTOR/PARALLEL LOO P … ELSE … SEQUENTIAL LOOP …

  15. Vector “addps xmm1, xmm0 “ xmm0 xmm1 + xmm1

  16. VECTOR (N operations) SCALAR (1 operation) v2 v1 r2 r1 + + r3 v3 vector length add r3, r1, r2 vadd v3, v1, v2 Vector - CPU

  17. CPU Vector + Parallel - 4 Cores x vectors

  18. 0 1 2 3 4 5 6 7 threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … Arrays of Parallel Threads - SPMD • All threads run the same code (SPMD)‏ • Each thread has an ID that it uses to compute memory addresses and make control decisions

  19. Agenda • Hardware • 2. C++ auto vec+par • 3. C++ PPL • 4. C++ AMP

  20. C++ Vectorizer – VS2012 Compiler Super Scalar Vector Vector + Parallel

  21. Simple vector add loop for (i = 0; i < 1000/4; i++){ movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } for (i = 0; i < 1000; i++) A[i] = B[i] + C[i]; Compiler look across loop iterations !

  22. Compiler or “Do it yourself” C++ void add(float* A, float* B, float* C, int size) { for (inti = 0; i < size/4; ++i) { p_v1 = _mm_loadu_ps(A); p_v2 = _mm_loadu_ps(B); res = _mm_sub_ps(p_v1,p_v2); _mm_store_ps(C,res); …. C++ or Klingon

  23. Vector - all loads before all stores “addps xmm1, xmm0 “ xmm0 xmm1 + xmm1

  24. Legal to vectorize ? FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1) Not Equal !! A (2:5) = A (1:4) + A (3:7) A(3) = ?

  25. Vector Semantics • ALL loads before ALL stores A (2:4) = A (1:4) + A (3:7) VR1 = LOAD(A(1:4)) VR2 = LOAD(A(3:7)) VR3 = VR1 + VR2 // A(3) = F (A(2) A(4)) STORE(A(2:4)) = VR3

  26. Vector Semantics • Instead - load store load store ... FOR ( j = 2; j <= 257; j++) A( j ) = A( j-1 ) + A( j+1 ) A(2) = A(1) + A(3) A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) ) A(4) = A(3) + A(5) A(5) = A(4) + A(6) …

  27. Doubled the optimizer A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

  28. for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r; r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z; float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared; float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube; acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s; } Legal vect+par? Complex C++ Not just arrays!

  29. Hard! Compiler reports why it failed to vectorize or parallelize cl /Qvect-report:2 /O2 t.cpp cl /Qpar-report:2 /O2 t.cpp

  30. Vector + Parallel “4 Cores x vector of 4 ops”

  31. Parallelism + vector void foo() { CompilerParForLib(0, 1000, 4, &foo$par1, A, B, C); } foo$par1(int T1, int T2, int *A, int *B, int *C) { for (int i=T1; i<T2; i+=4) movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } void foo() { #pragmaloop(hint_parallel(4)) for (int i=0; i<1000; i++) A[i] = B[i] + C[i]; } • foo$par1(0, 249, A, B, C); core 1 instr • foo$par1(250, 499, A, B, C); core 2 instr • foo$par1(500, 749, A, B, C); core 3 instr • foo$par1(750, 999, A, B, C); core 4 instr Runtime Vectorized+ and parallel

  32. The Bigger Picture VECT UNIT SCLR UNIT SCLR UNIT VECT UNIT VECT UNIT SCLR UNIT SCLR UNIT VECT UNIT

  33. Vector + parallel DemoDev10/Win7 - fully optimizednovec_concrt.aviDev11/Win8 – fully optimizedvec_omp.avi

  34. Not your grandfather’s vectorizer for (k = 1; k <= M; k++) {   mc[k] = mpp[k-1]   + tpmm[k-1]; if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k])  mc[k] = sc;   mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if(dc[k] < -INFTY) dc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if(dc[k] < -INFTY) dc[k] = -INFTY; if(k < M) { ic[k] = mpp[k] + tpmi[k]; if((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if(ic[k] < -INFTY) ic[k] = -INFTY; } } for(k = 1; k <= M; k++) { if(k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } } for(k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; }

  35. Vector control flow algebra • ID source code that looks like this (in a loop): if (X > Y) { Y = X; } • vectorizer could create: Y = MAX(X, Y)

  36. Vector control flow “pmax xmm1, xmm0 “ xmm0 xmm1 pmax xmm1

  37. Not your grandfather’s vectorizer if ( __isa_availablility > SSE2 && NO_ALIASISIN ) { for (k = 1; k <= M; k++) {   mc[k] = mpp[k-1]   + tpmm[k-1];   mc[k] = MAX(ip[k-1]  + tpim[k-1], mc[k])   mc[k] = MAX (dpp[k-1] + tpdm[k-1]) > mc[k])   mc[k] = MAX (xmb + bp[k], mc[k])   mc[k] = MAX (mc[k], -INFTY) } for (k = 1; k <= M; k++) {   dc[k] = dc[k-1] + tpdd[k-1];   dc[k] = MAX (mc[k-1] + tpmd[k-1],dc[k])   dc[k] = MAX (dc[k], -INFTY) } for (k = 1; k <= M; k++) { ic[k] = mpp[k] + tpmi[k]; ic[k] = MAX (ip[k] + tpii[k],ic[k]) ic[k] += is[k]; ic[k] = MAX (ic[k],-INFTY) } } Vector loop Scalar loop Vector loop

  38. Vector Math Libary 15x faster

  39. Vectorization – targeting vector library • for (i=0; i<n; i++) { • a[i] = a[i] + b[i]; • a[i] = sin(a[i]); • } NEW Run-Time Library HW SIMD instruction • for(i=0; i<n; i=i+VL) { • a(i: i+VL-1) = a(i: i+VL-1) + b(i: i+VL-1); • a(i: i+VL-1) = _svml_Sin(a(i: i+VL-1)); • }

  40. Parallel and vector – on by default    for (inti = 0; i < _countof(a); ++i) {         float dp = 0.0f;         for (int j = 0; j < _countof(a); ++j){             float fj = (float)j; dp += sin(fj) * exp(fj);         }         a[i] = dp;    }

  41. Pragma Foo (float *a, float *b, float *c) { #pragma loop(hint_parallel(N)) for (auto i=0; i<N; i++) { *c++ = (*a++) * bar(b++); }; Use simple directives Pointers and procedure calls with escaped pointers prevent analysis for auto-parallelization

  42. 16x speedup – unmodified C++ • Scheduling • Static • Dynamic

  43. …and compiler selects scheduler strategy • for (int l = top; l < bottom; l++){ • for (int m = left; m < right; m++ ){ • int y = *(blurredImage + (l*dimX) +m); • ySourceRed += (unsigned int) (y & 0x00FF0000) >> 16; • ySourceGreen += (unsigned int) (y & 0x0000ff00) >> 8; • ySourceBlue += (unsigned int) (y & 0x000000FF); • averageCount++; • } • }

  44. Software – “no magic bullet” C++ PPL- cpu parallel_for (0, 1000, 1, [&](inti) { A[i] = B[i] + C[i]; } ); C++ AMP - gpu parallel_for_each ( e, [&] (index<2> idx) restrict(amp) { c[idx] = b[idx] + a[idx]; } ); copy(c,pC); C++ vectorizer -cpu for (int i=0; i<1000; i++) A[i] = B[i] + C[i];

  45. Software C++ PPL parallel_for (0, 1000, 1, [&](inti) { } ); C++ AMP parallel_for_each( e, [&] (index<2> idx) restrict(amp) { } ); copy(c,pC); C++ vectorizer for (int i=0; i<1000; i++)

  46. Built with C++ • Windows 8 SQL Office • Mission critical correctness and compile time

  47. PPL for C++ • Parallel Programming Libaray

  48. 3 PPL constructs – simple but huge value • parallel_invoke( • [&]{quicksort(a, left, i-1, comp);}, • [&]{quicksort(a, i+1, right, comp);} ); parallel_for (0, 100, 1, [&](inti) { /* …*/ } ); vector<int> vec; parallel_for_each (vec.begin(), vec.end(), [&](int& i) { /* ... */ });

More Related