1 / 68

Parallelism & Algorithm Acceleration

Parallelism & Algorithm Acceleration. M. Wijtvliet Embedded Computer Architecture 5KK73. But first …. Thursday: handout GPU assignment How many of you have laptop with OpenCL or CUDA capable video card?. Today’s topics. The importance of memory access patterns

Download Presentation

Parallelism & Algorithm Acceleration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelism & Algorithm Acceleration M. Wijtvliet Embedded Computer Architecture 5KK73

  2. But first … • Thursday: handout GPU assignment • How many of you have laptop with OpenCL or CUDA capable video card?

  3. Today’s topics • The importance of memory access patterns • Vectorisation and access patterns • Strided accesses on GPUs • Data re-use on GPUs and FPGA’s • Classifying memory access patterns • Berkeley’s ‘7 dwarfs’ • Algorithmic species • Algorithmic skeletons • Algorithmic skeletons for accelerators 5KK73 |Slides: C. Nugteren

  4. Vector-SIMD execution SIMD processes multiple scalar operations concurrently ld r1, addr1 ld r2, addr2 add r3, r1, r2 st r3, addr3 for (i=0; i<N; i++) c[i] = a[i] + b[i]; ldv vr1, addr1 ldv vr2, addr2 addv vr3, vr1, vr2 stv vr3, addr3 N iters N / 4 iters 5KK73 | Slides: C. Nugteren

  5. Vector-SIMD execution A single instruction being executed: • By multiple processing engines (ALUs, PEs, cores, nodes) • Concurrently in lockstep (no synchronization) • On multiple data elements Present in a wide range of architectures • SIMD, GPU, AVX, SSE, NEON, Xetal, etc. Type of parallelism that is easy and cheap to implement • No coherence problem • No lock problem Caveat: Hard to program and/or easy to lose many factors of performance 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  6. How to use SIMD instructions? Pick your favourite: • Vectorising compiler (ICC, latest GCCs) • Macros or intrinsics • Assembly for (i=0; i<N; i++) c[i] = a[i] + b[i]; __m128 rA, rB, rC; for (inti = 0; i <N; i+=4) { rA = _mm_load_ps(&a[i]); rB = _mm_load_ps(&b[i]); rC = _mm_add_ps(rA,rB); _mm_store_ps(&C[i], rC); } ..B8.5 movaps a(,%rdx,4), %xmm0 addps b(,%rdx,4), %xmm0 movaps %xmm0, c(,%rdx,4) addq $4, %rdx cmpq $rdi, %rdx jl ..B8.5 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  7. What is the performance impact? Properties of the example: • Stride-1 accesses to array a • Inner loop has independent operations (no loop carried dependences) • Array a resides in L1 cache (12.5 KB) Performance in GOPS/s on 128-bits wide CPU: for (i=0; i<N; i++) a[i] = a[i] + 1; 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  8. Strided accesses (1/2) Properties of the example: • Stride-16 accesses to array a • Inner loop has independent operations • Array a resides in L1 cache Performance in GOPS/s on 128-bits wide CPU: for (i=0; i<N; i+=16) a[i] = a[i] + 1; Why no performance gain? • Operands are not contiguous in memory • Multiple loads/stores, vector pack/unpack • No auto-vectorisation in GCC • ICC vectorises, but no gains 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  9. Strided accesses (2/2) Generalised example (still L1 resident) Performance in GOPS/s on 128-bits wide CPU: for (i=0; i<N; i+=STRIDE) a[i] = a[i] + 1; 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  10. Dependent operations Properties of the example: • Stride-1 accesses to array a • Inner loop has dependent operations • Array a resides in L1 cache Performance in GOPS/s on 128-bits wide CPU: for (i=0; i<N; i++) a[i] = a[i-1] + 1; Why no performance gain? • Iteration i depends on iteration i-1 • Inner loop cannot be parallelised 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  11. L1 versus main memory Properties of the example: • Stride-1 accesses to array a • Inner loop has independent operations • Array a resides in main memory(DRAM) Performance in GOPS/s on 128-bits wide CPU: for (i=0; i<10000*N; i++) a[i] = a[i] + 1; Why is performance limited? • Code has become memory bandwidth bound • Explained by the “roofline model” 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  12. Multi-core scaling #pragmaomp parallel for for (i=0; i<N; i++) a[i] = a[i] + 1; #pragmaomp parallel for for (i=0; i<10000*N; i++) a[i] = a[i] + 1; 5KK73 | Slides: C. Nugteren [Slides taken from P. Sadayappan]

  13. Lessons learned from vectorization Vectorizationand parallelisation are important • Significant speed-ups can be obtained... • ...depending on the memory access patterns! Performance depends on the memory access pattern • Strided accesses • Dependent / independent operations • Size of data structures Performance / implementation will differ per architecture • Vector width and data types • L1 resident or not (L1 cache size, DRAM bandwidth, etc.) Bottom line: Let’s take a closer look at memory access patterns 5KK73 | Slides: C. Nugteren

  14. Strided accesses on GPUs Performance in GB/s on a Tesla C2050: __global__ 
void
 stride_copy(float
* out, 
float
* in) { int id
 = blockIdx.x*blockDim.x + threadIdx.x; out[id*STRIDE]
 = 
in[id*STRIDE]; } Why is performance deteriorating? • Memory accesses are no longer coalesced • Not all data in cache-lines are used 5KK73 | Slides: C. Nugteren

  15. Data-reuse on GPUs Properties of the example: • Each data element is used 3 times (data-reuse) • Memory bandwidth is the limiting performance factor • Use the GPU’s scratchpad memory (shared) to benefit from reuse • Newer GPUs use caches to benefit automatically • Expected performance gain: up to 2x __global__ 
void
 filter(float
* out, 
float
* in) { int id
 = blockIdx.x*blockDim.x + threadIdx.x; out[id] = 
0.33 * (in[id-1] + in[id] + in[id+1]); } id reuse id+1 in[] out[] 5KK73 | Slides: C. Nugteren

  16. Today’s topics • The importance of memory access patterns • Vectorizationand access patterns • Strided accesses on GPUs • Data re-use on GPUs and FPGA’s • Classifying memory access patterns • Berkeley’s ‘7 dwarfs’ • Algorithmic species • Algorithmic skeletons • Algorithmic skeletons for accelerators 5KK73 | Slides: C. Nugteren

  17. Classifying program code Berkeley’s ‘7 dwarves’ of computation: • Dense Linear Algebra • Sparse Linear Algebra • Spectral Methods • N-Body Methods • Structured Grids • Unstructured Grids • MapReduce • Combinational Logic • Graph Traversal • Dynamic Programming • Backtrack and Branch-and-Bound • Graphical Models • Finite State Machines More information: http://view.eecs.berkeley.edu (“A View From Berkeley”) 5KK73 | Slides: C. Nugteren

  18. Classifying memory access patterns Berkeley’s dwarves are • High-level and intuitive, but... • ...don’t capture all relevant details of memory access patterns • Not formalised nor exact: classes are based on a textual description Can we do better? • Introducing ‘algorithmic species’ • A classification of code based onmemory access patterns 5KK73 | Slides: C. Nugteren

  19. Algorithmic species examples (1/3) Basic ‘forall’ matrix copy • Each i,j iteration one data element is read from M • Each i,j iteration one data element is written to R for(i=0; i<64; i++) { for(j=0; j<128; j++) { R[i][j] = 2 ∗ M[i][j]; } } M[0:63,0:127]|element → R[0:63,0:127]|element 5KK73 | Slides: C. Nugteren

  20. Algorithmic species examples (2/3) Matrix-vector multiplication • Each i iteration a row is read from M and the full vector v • Each i iteration one element of the vector r is produced for(i=0; i<64; i++) { r[i] = 0; for(j=0; j<128; j++) { r[i] += M[i][j] ∗ v[j]; } } M[0:63,0:127]|chunk(-,0:127) + v[0:127]|full → r[0:63]|element 5KK73 | Slides: C. Nugteren

  21. Algorithmic species examples (3/3) Filter with data-reuse • Each i iteration three neighbouring elements from a are read • Each i iteration one element of m is produced for(i=1; i<128-1; i++) { m[i] = 0.33 ∗ (a[i−1]+a[i]+a[i+1]); } a[1:126]|neighbourhood(-1:1) → m[1:126]|element 5KK73 | Slides: C. Nugteren

  22. How can we use a classification? Consider the earlier GPU ‘filter’ example: • Each data element is used 3 times (data-reuse) • Use the GPU’s scratchpad memory (shared) to benefit from reuse • What if we had an optimised pre-implemented ‘skeleton’ (template) for such neighbourhood type of computations? __global__ 
void
 filter(float
* out, 
float
* in) { int id
 = blockIdx.x*blockDim.x + threadIdx.x; out[id] = 
0.33 * (in[id-1] + in[id] + in[id+1]); } id reuse id+1 in[] out[] 5KK73 | Slides: C. Nugteren

  23. Using algorithmic skeletons <args> = float
* out, 
float
* in <computation> = 0.33 * (in[i-1] + in[i] + in[i+1]) <input> = in <output> = out <type>= float __global__ 
void
 filter(float
* out, 
float
* in) { int id
 = blockIdx.x*blockDim.x + threadIdx.x; intsid = threadIdx.x; // Load into local (shared) memory __shared__smem[512]; smem[sid] = in[id]; __syncthreads(); // Perform the computation float res = 0.33*(smem[sid-1]+smem[sid]+smem[sid+1]); out[id] = res; } (user input) + __global__ 
void
 neighbourhood_skeleton(<args>) { int id
 = blockIdx.x*blockDim.x + threadIdx.x; intsid = threadIdx.x; // Load into local (shared) memory __shared__<type>smem[512]; smem[id] = <input>[id]; __syncthreads(); // Perform the computation <type> res = <computation> <output>[id] = res; } (instantiated skeleton) (simplified skeleton) 5KK73 | Slides: C. Nugteren

  24. “local” means denoising Average over 3x3, 5x5 area 5KK73 | Slides: G.J. van den Braak

  25. Non-local means denoising Look for similar pixels in a large window (21x21) Determine similarity using a small (3x3) patch 5KK73 | Slides: G.J. van den Braak

  26. Species, Skeletons, A-Darwin and Bones sequential C code ‘A-Darwin’ algorithmic species extraction tool • A-Darwin and Bones are available via: • https://github.com/CNugteren/bones • https://github.com/gjvdbraak/bones species-annotated C code ‘Bones’ skeleton-based compiler CPU-OpenMP GPU-OpenCL-AMD CPU-OpenCL-AMD CPU-OpenCL-Intel GPU-CUDA 5KK73 | Slides: G.J. van den Braak

  27. Non-local means – implementation All pixels in the image for(intpy=11; py<501; py++) { for(intpx=11; px<501; px++) { float Cp=0, sum = 0; for(intqy=-10; qy<=10; qy++) { for(intqx=-10; qx<=10; qx++) { float d = 0; for(intfy=-1; fy<=1; fy++) { for(int fx-1; fx<=1; fx++) { floatpix_p = in[py + fy][px + fx]; floatpix_q = in[py + qy + fy][px + qx + fx]; float delta = pix_p - pix_q; d += delta * delta; } } float dd = (1.0f / 9.0f) * d; float w = expf(-1.0f*fmax(dd-2.0f*S*S, 0.0f)/(H*H)); Cp += w; sum += in[py + qy][px + qx] * w; } } out[py][px] = (1.0f/Cp) * sum; } } All pixels in a window All pixels in a patch ±15k FLOPS per pixel (!) Implementation based on: http://www.ipol.im/pub/art/2011/bcm_nlm/ 5KK73 | Slides: G.J. van den Braak

  28. Species classification (A-Darwin) sequential C code • #pragma species copyin • in[10:501, 10:501] • in[ 0:511, 0:511] • in[ 1:510, 1:510] • #pragma species kernel • in[10:501, 10:501]|chunk(-1:1, -1:1) • in[ 0:511, 0:511]|chunk(-11:11, -11:11) • in[ 1:510, 1:510]|chunk(-10:10, -10:10) • -> out[11:500, 11:500]|element • #pragma species copyout • out[11:500, 11:500] algorithmic species extraction tool species-annotated C code skeleton-based compiler GPU-CUDA CPU-OpenMP 5KK73 | Slides: G.J. van den Braak

  29. Test setup • CPU: Intel Core i7-4770 @ 3.4GHz • GPU: Nvidia GTX 760 (Kepler) • 512 x 512 grayscale image • 21 x 21 search window • 3 x 3 patch size 5KK73 | Slides: G.J. van den Braak

  30. Results – Bones GPU • CUDA kernel argument: • float * in // general read/write pointer • floatconst * const __restrict__ in // unique read-only pointer, // accessed via the data cache 5KK73 | Slides: G.J. van den Braak

  31. Today’s topics • The importance of memory access patterns • Vectorisation and access patterns • Strided accesses on GPUs • Data re-use on GPUs and FPGA’s • Classifying memory access patterns • Berkeley’s ‘7 dwarfs’ • Algorithmic species • Algorithmic skeletons • Algorithmic skeletons for accelerators 5KK73 | Slides: M. Wijtvliet

  32. During development of your application you see something like this … • … And you already did all the optimizations possible but you don’t meet requirements … Function A Function B Function C Function D Function E 5KK73 | Slides: M. Wijtvliet

  33. Contents • Accelerators • Introduction FPGAs • High Level Synthesis • Software skeletons • Hardware skeletons • MAMPSx 5KK73 | Slides: M. Wijtvliet

  34. Accelerators • What is an accelerator • Put some part of the application on (dedicated) hardware • To speed up the execution. • To make the application more energy efficient. • Usually good for algorithms with high level of parallelism or pipelining. CPU Accelerator 5KK73 | Slides: M. Wijtvliet

  35. Accelerators • Spatial parallelism: • Pipelining: B[0] C[0] B[7] C[7] For (i=0; i < 8; i++){ A[i] = B[i] + 2*C[i];} A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[0] B[0] b[1] For (i=0; i < 3; i++){ A[i+1] = A[i] + B[i];} b[2] A[3] 5KK73 | Slides: M. Wijtvliet

  36. Accelerators • Can be implemented on: • FPGA • ASIC • ASIP • GPU • Each with their own strengths and weaknesses: • Performance • Flexibility • Energy efficiency 5KK73 | Slides: M. Wijtvliet

  37. Accelerators • Often in cooperation with a normal CPU (or MCU). • Now also increasingly used on Systems-on-Chip (SoC). 5KK73 | Slides: M. Wijtvliet

  38. Accelerators • When is it useful to make an accelerator? • Profiling the application turns out large number of cycles are spent on a certain function. • Communication and synchronization overhead is not significant. • Again: parallelism (and data dependencies). 5KK73 | Slides: M. Wijtvliet

  39. Introduction FPGAs CPU GPU FPGA ASIP ASIC Performance Flexibility Unit cost 5KK73 | Slides: M. Wijtvliet

  40. Introduction FPGAs • Consist of many logic blocks that can be connected. • Logic blocks contain logic gates, flip-flops, look-up-tables. • FPGAs can also contain DSPs, RAM blocks, etc. 5KK73 | Slides: M. Wijtvliet

  41. Introduction FPGAs • Interconnects 5KK73 | Slides: M. Wijtvliet

  42. Introduction FPGAs • Inside a Configurable Logic Block (CLB) 5KK73 | Slides: M. Wijtvliet

  43. Introduction FPGAs • Logic cells 5KK73 | Slides: M. Wijtvliet

  44. Introduction FPGAs • Many more varieties exist 5KK73 | Slides: M. Wijtvliet

  45. Introduction FPGAs • Some logic cells have RAM cells or Shift registers. 5KK73 | Slides: M. Wijtvliet

  46. Introduction FPGAs • Special blocks, also called “hard macro’s” • DSPs • Blocks of RAM/ROM… • Complete CPUs 5KK73 | Slides: M. Wijtvliet

  47. Introduction FPGAs • More complex systems 5KK73 | Slides: M. Wijtvliet

  48. Introduction FPGAs • By configuring the interconnect logic blocks can be connected together. • By combining this almost any digital circuit can be made. • Some FPGAs can be partially reconfigured at runtime. • FPGAs are often used for ASIC prototyping. 5KK73 | Slides: M. Wijtvliet

  49. Introduction FPGAs • You don’t program instructions… • But describe how logic elements will be connected and how they are configured. • Inherently concurrent. • Verilog & VHDL. • Clock and timing issues reg [1:0] A,B; initial begin A = 1; B = 2; Clk = 0; End always @(posedgeClk) begin A <= B; B <= A; end 5KK73 | Slides: M. Wijtvliet

  50. Introduction FPGAs • Hardware described in Verilog, VHDL or another RTL language. • Get the functionality correct. • Get the timing correct. • Debugging can be tricky. • Isn’t there a easier way? 5KK73 | Slides: M. Wijtvliet

More Related