GPU computing with C++ AMP

GPU computing with C++ AMP Kenneth Domino Domem Technologies, Inc. http://domemtech.com October 24, 2011 NB: This presentation is based on Daniel Moth’s presentation at http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T

CPUs vs GPUs today CPU • Low memory bandwidth • Higher power consumption • Medium level of parallelism • Deep execution pipelines • Random accesses • Supports general code • Mainstream programming

CPUs vs GPUs today GPU High memory bandwidth Lower power consumption High level of parallelism Shallow execution pipelines Sequential accesses Supports data-parallel code Niche programming

Tomorrow… • CPUs and GPUs coming closer together… • …nothing settled in this space, things still in motion… • C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow image source: AMD

C++ AMP • Part of Visual C++ • Visual Studio integration • STL-like library for multidimensional data • Builds on Direct3D performance productivity portability

C++ AMP vs. CUDA vs. OpenCL C++ AMP OpenCL CUDA – Wow, classes! (Stone knives and bearskins – C99!! YUCK!) Lambda functions (1936 Church—back to the future)

Hello World: Array Addition How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU? void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } }

Hello World: Array Addition void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } } void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } } #include <amp.h> using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pC) { array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } ); }

Basic Elements of C++ AMP coding void AddArrays(int n, int * pA, int * pB, int * pC) { array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx]; } ); } restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator) parallel_for_each: execute the lambda on the accelerator once per thread array_view: wraps the data to operate on the accelerator grid: the number and shape of threads to execute the lambda array_view variables captured and associated data copied to accelerator (on demand) index: the thread ID that is running the lambda, used to index into data

grid<N>, extent<N>, and index<N> • index<N> • represents an N-dimensional point • extent<N> • number of units in each dimension of an N-dimensional space • grid<N> • origin (index<N>) plus extent<N> • N can be any number

Examples: grid, extent, and index index<1> i(2); index<2> i(0,2); index<3> i(2,0,1); extent<1> e(6); extent<2> e(3,4); extent<3> e(3,2,2); grid<1> g(e); grid<2> g(e); grid<3> g(e);

array<T,N> • Multi-dimensional array of rank N with element T • Storage lives on accelerator vector<int> v(96); extent<2> e(8,12); // e[0] == 8; e[1] == 12; array<int,2> a(e, v.begin(), v.end()); // in the body of my lambda index<2> i(3,9); // i[0] == 3; i[1] == 9; into = a[i]; //or a[i] = 16; //int o = a(i[0], i[1]);

array_view<T,N> • View on existing data on the CPU or GPU • array_view<T,N> • array_view<const T,N> vector<int> v(10); extent<2> e(2,5); array_view<int,2> a(e, v); //above two lines can also be written //array_view<int,2> a(2,5,v);

Data Classes Comparison array<T,N> array_view<T,N> Rank at compile time Extent at runtime Rectangular Dense in one dimension Wrapper for data Implicit copy Capture by value [=] • Rank at compile time • Extent at runtime • Rectangular • Dense • Container for data • Explicit copy • Capture by reference [&] KED Note: array_view’s seems faster than array<>. Could it be because of the on-demand feature of array_view?

parallel_for_each • Executes the lambda for each point in the extent • As-if synchronous in terms of visible side-effects parallel_for_each( g, //g is of type grid<N> [ ](index<N> idx) restrict(direct3d) { // kernel code } );

restrict(…) • Applies to functions (including lambdas) • Why restrict • Target-specific language restrictions • Optimizations or special code-gen behavior • Future-proofing • Functions can have multiple restrictions • In 1st release we are implementing direct3d and cpu • cpu – the implicit default

restrict(direct3d) restrictions • Can only call other restrict(direct3d) functions • All functions must be inlinable • Only direct3d-supported types • int, unsigned int, float, double • structs & arrays of these types • Pointers and References • Lambdas cannot capture by reference¹, nor capture pointers • References and single-indirection pointers supported only as local variables and function arguments

restrict(direct3d) restrictions • No • recursion • 'volatile' • virtual functions • pointers to functions • pointers to member functions • pointers in structs • pointers to pointers • No • goto or labeled statements • throw, try, catch • globals or statics • dynamic_cast or typeid • asm declarations • varargs • unsupported types • e.g. char, short, long double

Example: restrict overloading double bar( double ) restrict(cpu,direc3d); // 1: same code for both double cos( double ); // 2a: general code double cos( double ) restrict(direct3d); // 2b: specific code void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]);// ok double d2 = cos(c[idx]);// ok, chooses direct3d overload //… }); }

accelerator, accelerator_view • accelerator • e.g. DX11 GPU, REF • e.g. CPU • accelerator_view • a context for scheduling and memory management Host Accelerator (GPU example) CPUs GPU PCIe GPU System memory GPU GPU • Data transfers • between accelerator and host • could be optimized away for integrated memory architecture

Example: accelerator // Identify an accelerator based on Windows device ID accelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”); // …or enumerate all accelerators (not shown) // Allocate an array on my accelerator array<int> myArray(10, myAcc.default_view); // …or launch a kernel on my accelerator parallel_for_each(myAcc.default_view, myArrayView.grid, ...);

C++ AMP at a Glance (so far) • restrict(direct3d, cpu) • parallel_for_each • class array<T,N> • class array_view<T,N> • class index<N> • class extent<N>, grid<N> • class accelerator • class accelerator_view

Achieving maximum performance gains • Schedule threads in tiles • Avoid thread index remapping • Gain ability to use tile static memory • parallel_for_each overload for tiles accepts • tiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2> • a lambda which accepts • tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2> extent<2> e(8,6); grid<2> g(e); g.tile<2,2>() g.tile<4,3>()

tiled_grid, tiled_index • Given • When the lambda is executed by • t_idx.global = index<2> (6,3) • t_idx.local = index<2> (0,1) • t_idx.tile = index<2> (3,1) • t_idx.tile_origin = index<2> (6,2) array_view<int,2> data(8, 6, p_my_data); parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2>t_idx)… { … });

tile_static, tile_barrier • Within the tiled parallel_for_each lambda we can use • tile_static storage class for local variables • indicates that the variable is allocated in fast cache memory • i.e. shared by each thread in a tile of threads • only applicable in restrict(direct3d) functions • class tile_barrier • synchronize all threads within a tile • e.g. t_idx.barrier.wait();

C++ AMP at a Glance • restrict(direct3d, cpu) • parallel_for_each • class array<T,N> • class array_view<T,N> • class index<N> • class extent<N>, grid<N> • class accelerator • class accelerator_view • tile_static storage class • class tiled_grid< , , > • class tiled_index< , , > • class tile_barrier

EXAMPLES!!! • The beloved, ubiquitous matrix multiplication • Bitonic sort

Example: Matrix Multiplication

Example: Matrix Multiplication void Multiply_Serial(Matrix * C, Matrix * A, Matrix * B) { intwA = A->cols; inthA = A->rows; intwB = B->cols; inthB = B->rows; intwC = C->cols; inthC = C->rows; for (intgr = 0; i < hA; ++gr) // row for (intgc= 0; j < wB; ++gc) { // col float sum = 0; for (int k = 0; k < hB; ++k) sum += Data(A)[gr * wA+k] * Data(B)[k*wB +gc]; Data(C)[gr * wC + gc] = sum; } } void MultiplySimple(Matrix * C, Matrix * A, Matrix * B) { intwA = A->cols; inthA = A->rows; intwB = B->cols; inthB = B->rows; intwC = C->cols; inthC = C->rows; array_view<constfloat,1> a(hA * wA, Data(A)); array_view<constfloat,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC* wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { intgr = idx[0]; intgc= idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; }); }

Example: Matrix Multiplication

Example: Matrix Multiplication (tiled, shared memory) void MultiplySimple(Matrix * C, Matrix * A, Matrix * B) { intwA = A->cols; inthA = A->rows; intwB = B->cols; inthB = B->rows; intwC = C->cols; inthC = C->rows; array_view<constfloat,1> a(hA * wA, Data(A)); array_view<constfloat,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC* wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { intgr = idx[0]; intgc= idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; }); } void MultiplySimple(Matrix * C, Matrix * A, Matrix * B) { intwA = A->cols; inthA = A->rows; intwB = B->cols; inthB = B->rows; intwC = C->cols; inthC = C->rows; array_view<constfloat,1> a(hA * wA, Data(A)); array_view<constfloat,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); constint TS = 16; parallel_for_each(g.tile<TS,TS>(), [=](tiled_index<TS,TS>idx) restrict(direct3d) { intlr = idx.local[0]; intlc = idx.local[1]; int gr = idx.global[0]; intgc = idx.global[1]; float sum = 0.0f; for (inti = 0; i < hB; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[lr][lc] = a[gr * wA + lc + i]; locB[lr][lc] = b[(lr + i) * wB + gc]; idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[lr][k] * locB[k][lc]; idx.barrier.wait(); } c[gr * wC + gc] = sum; }); }

Performance C++ AMP performs as good as CUDA or OpenCL for matrix multiplication. Random matrices A (480 x 640) x B (640 x 960) = C (480 x 960) single prec. floats. Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used), and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS. 10 runs, first run thrown out (for total of 9). Average ± S.E. of sample. NB: this comparison is between “Developer Preview vs RTM

Bitonic sort

void bitonicSortSimple(int * data, int length) { unsigned int log2length = log2(length); unsigned intchecklength = pow2(log2length); static constint TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { intig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { intig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); }); } } } Bitonic sort void bitonicSortSequential(int * data, int length) { unsigned int log2length = log2(length); unsigned intchecklength = pow2(log2length); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; unsigned int phase2 = pow2((unsigned int)phase); for (intig = 0; ig < compares; ++ig) { int cross, paired; orange_box(ig, phase2, cross, paired); swap(data[cross], data[paired]); } for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); for (intig = 0; ig < compares; ++ig) { int cross, paired; red_box(ig, level2, cross, paired); swap(data[cross], data[paired]); } } } }

Performance C++ AMP performs almost as good as CUDA for (naïve) bitonic sort. Array of length 1024 * 32 * 16 * 16 = 8388608, of integers. Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used), and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS. 10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.

Bitonicsort with subsets of bitonic merge Peters, H., O. Schulz-Hildebrandt, et al. (2011). "Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort." Concurrency and Computation: Practice and Experience 23(7): 681-693.

void bitonicSortSimple(int * data, int length) { unsigned int log2length = log2(length); unsigned intchecklength = pow2(log2length); static constint TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { intig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { intig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); }); } } } Optimized bitonicsort • Optimization: compute multiple compares per thread using subsets of the sorted merge of a given degree. • Load the portion of the subset bitonic merge into registers. • Perform swap in registers. • Store registers back to global memory. • NB: Must manually unroll loop – bug in compiler with #pragma unroll! for (inti = 0; i < msize; ++i) mm[i] = a[base + memory[i]]; for (inti = 0; i < csize; i += 2) { intcross = normalized_compares[i]; intpaired = normalized_compares[i+1]; swap(mm[cross], mm[paired]); } for (inti = 0; i < msize; ++i) a[base + memory[i]] = mm[i];

Performance C++ AMP performs almost as good as CUDA for (naïve) bitonic sort. Array of length 1024 * 32 * 16 * 16 = 8388608, of integers. Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used), and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS. 10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.

Issues • All GPU memory must be an array_view<> (or array<>), so you must use • an array index!!! • foo_bar[…] = …; • Pointer to struct/class unavailable, but struct/class is OK: • class foo {} • foo f; • … • Parallel_for_each… • f.bar(); // cannot convert 'this' pointer from 'const foo' to 'foo &‘ • foo f; • array_view<foo,1> xxx(1, &f); • parallel_for_each… • xxx[0].bar(); // OK, and must use []’s—YUCK!

Issues Easy to forget that after parallel_for_each, results must be fetched via 1) call arrray_view synchronize() for the variable; 2) destruction of array_view variable; 3) explicit copy for array<>. Take care in comparing CUDA/OpenCL with C++ AMP: 1) C++ AMP may not pick right optimal accelerator if you have two GPU’s; 2) C++ AMP picks tile size if you don’t specify it. If you do pick a tile size, it must fit with an exact multiple into the grid!

GPU computing with C++ AMP

GPU computing with C++ AMP

Presentation Transcript

Taming GPU compute with C++ Accelerated Massive Parallelism

Enhancing GPU for Scientific Computing

Lecture 6: Shared-memory Computing with GPU

Getting Started with GPU Computing

GPU based cloud computing

GPU Computing

Optimizations Techniques for GPU Computing

Approaches to GPU Computing

GPU Computing and CUDA

GPU Computing with Condor @The Hartford

GPU Parallel Computing

GPU Computing with CUDA as a focus

Scan Primitives for GPU Computing

CUDA GPU Computing

NVIDIA Tesla GPU Computing Products

S03: High Performance Computing with CUDA Heterogeneous GPU Computing for Molecular Modeling

GPU Computing

GPU Computing

GPU MANUFACTURERS WITH JETALL GPU

GPU Computing: An Introduction