1 / 51

Evolution of OpenCL *

Introduction to OpenCL * Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer . Evolution of OpenCL *. Sequential Programs. int main() { //read input scalar_mul (…) return 0; }. void scalar_mul ( int n,

berit
Download Presentation

Evolution of OpenCL *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to OpenCL*Ohad ShachamIntel Software and Services GroupThanks to Elior Malul, Arik Narkis, and Doron Singer

  2. Evolution of OpenCL* • Sequential Programs intmain(){ //read input scalar_mul(…) return 0; } voidscalar_mul(intn, const float *a, const float *b, float *c){ inti; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; }

  3. Evolution of OpenCL* • Multi-threaded Programs int main(){ //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0; } voidscalar_mul(int n, const float *a, const float *b, float *c){ inti; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; }

  4. Problems – concurrent programs • Writing concurrent programs is hard • Concurrent algorithms • Threads • Work balancing • Need to update programs when adding new cores to the system • Dataraces, livelocks, deadlocks • Solving bugs in concurrent programs is harder

  5. Evolution of OpenCL* • Vector instruction utilization intmain(){ //read input scalar_mul(…) return 0; } voidscalar_mul(intn, const float *a, const float *b, float *c){ inti; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } }

  6. Problems – vector instructions usage • Utilizing vector instructions in also not a trivial task • Vendor dependent code • Usage is not future proof • New efficient instruction • Wider vector registers

  7. GPGPU GPGPU stands for General-Purpose computation on Graphics Processing Units (GPUs). GPUs are high-performance many-core processors that can be used to accelerate a wide range of applications (www.gpgpu.org) Photo taken from: http://folding.stanford.edu/English/FAQ-NVIDIA

  8. GPUs utilization • Many corescan be utilized for computation • GPUs become programmable - GPGPU • CUDA* • Problems • Each vendor has its own language • Requires tweaking to get performance • How can I run both on CPUs and GPUs?

  9. What do we need? • Heterogeneous • Automatically utilizes all available processing units • Portable • High Performance • Utilize Hardware characteristics • Future Proof • Abstract concurrency from the user

  10. OpenCL* – heterogeneous computing Diagram based on deck presented in OpenCL* BOF at SIGGRAPH 2010 by Neil Trevett, NVIDIA, OpenCL* Chair

  11. OpenCL* in a nutshell • An OpenCL* application consists two parts: • A set of APIs in C that allows compiling and running OpenCL* “Kernels” • A code that is executed on the device by the OpenCL* runtime

  12. Data parallelism A fundamental pattern in high-performance parallel algorithms Applying same computation logic across multiple data elements C[i] = A[i] * B[i] i = 0 C[i] = A[i] * B[i] i = 1 i = 0 C[i] = A[i] * B[i] i = 2 C[i] = A[i] * B[i] C[i] = A[i] * B[i] i = 3 i = i + 1 C[i] = A[i] * B[i] i = N-2 C[i] = A[i] * B[i] i = N-1

  13. Data parallelism Usage • Client machines • Video transcoding and editing • Pro image editing • Facial recognition • Workstations • CAD tools • 3D data content creation • Servers • Science and simulations • Medical imaging • Oil & Gas • Finance (e.g., Black-Scholes) • …

  14. OpenCL* kernel example voidarray_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } __kernel voidarray_mul( __globalconst float *a, __globalconst float *b, __globalfloat *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; }

  15. OpenCL* kernel example __kernel voidarray_mul(__globalconst float *a, __globalconst float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; } get_global_id(0) a b c

  16. Execution Model Global WorkGroup WorkGroup WorkGroup WorkGroup WorkItem

  17. The OpenCL* model • OpenCL* runtime is invoked on Host CPU (using OpenCL* API) • Choose target device/s for parallel computation • Data-parallel functions, called Kernels, are compiled (on host) • Compiled for specific target devices (CPU, GPU, etc..) • Data chunks (called Buffers) are moved across devices • Kernel “commands” queued for execution on target devices • Asynchronous execution

  18. The OpenCL* - C language • Derived from ISO C99 • Few restrictions e.g., recursion, function pointers • Short vector types e.g., float4, short2, int16 • Built-in functions • math (e.g., sin), geometric, common (e.g., min, clamp)

  19. OpenCL* key features • Unified programming model for all devices • Develop once, run everywhere • Designed for massive data-parallelism • Implicitly takes care of threading and intrinsicsfor optimal performance 19

  20. OpenCL* key features • Dynamic compilation model (Just In Time - JIT) • Future proof, provided vendors update their implementations • Enables heterogeneous computing • A clever application can use all resources of the platform simultaneously 20

  21. Benefits to User • Hardware abstraction • write once, run everywhere • Cross devices, cross vendors • Automatic parallelization • Good tradeoff between development simplicity and performance • Future proof optimizations • Open standard • Supported by many vendors

  22. Benefits to Hardware Vendor • Enables good hardware ‘time to market’ • Programming model enables good hardware utilization • Applications are automatically portable and future proof • JIT compilation

  23. OpenCL* Cons • Low level – based on C99 • No heap! • Lean framework • Expert tool • In term of correctness and performance • OpenCL* is not performance portable • Tweaking is needed for each vendor • Future specs and implementations may require no tweaking?

  24. Vector dot multiplication voidvectorDotMul(int* vecA, int* vecB, intsize, int* result){ *result = 0; for (inti=0; i < size; ++i) *result += vecA[i] * vecB[i]; }

  25. Single work item 1 2 * = 2 1 2 * = 4 2 1 * 2 = 2 6 = 2 1 * 8 2 2 = 1 * 10 2 2 1 * = 2 12 * 2 = 1 14 12 2 2 16 1 * = 2

  26. Vector dot multiplication in OpenCL* __kernel void vectorDotMul(int* vecA, int* vecB, intsize, int* result) { if(get_global_id(0) == 0){ *result = 0;for (inti=0; i<size; ++i)*result += vecA[i] * vecB[i]; } }

  27. Single work group 1 2 2 * = 1 2 4 2 * = 1 2 * 2 = = 2 8 1 2 4 * 2 = 1 * 2 2 1 * = 12 2 4 * 2 = 1 2 2 1 * = 4 2 16

  28. __kernel void vectorDotMul(int* vecA, int* vecB, intsize, int* result){ int id = get_local_id(0); __localvolatile intpartialSum[MAX_SIZE]; intlocalSize = get_local_size(0); int work = size/localSize; int start = id*work;int end = start+work; for(int j=start; j<end; ++j)partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if(id == 0) *result = 0;for (inti=0; i<localSize; ++i) *result += partialSum[i];} Work item calculation Reduction

  29. Efficient reduction 1 2 2 * = 1 2 4 2 * = 1 2 * 2 = = 2 1 8 2 4 * 2 = 1 * 2 2 1 * = 4 2 4 * 2 = 1 2 2 1 * = 4 2 8 16

  30. Vectorization • Processors provide vector units • SIMD on CPUs • Warp on GPUs • Utilize to perform few operations in parallel • Arithmetic operations • Binary operations • Memory operation

  31. Loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; ++i) { c[i] = a[i] * b[i]; } }

  32. Loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } }

  33. Loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } }

  34. Automatic loop vectorization • Is there dependency between a, b, and c? voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; ++i) { c[i] = a[i] * b[i]; } }

  35. Automatic loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; ++i) { c[i] = a[i] * b[i]; } } b c

  36. Automatic loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } } b c

  37. Automatic vectorization in OpenCL* __kernel void mul(int size, int* a, int* b, int* c) { intid = get_global_id(0); c[id] = a[id] * b[id]; }

  38. Automatic vectorization in OpenCL* for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id]; }

  39. Automatic vectorization in OpenCL* for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3]; }

  40. Automatic vectorization in OpenCL* for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec); }

  41. Single work group 1 2 2 * = 1 2 4 2 * = 1 2 * 2 = = 2 1 8 2 4 * 2 = 1 * 2 2 1 * = 4 2 4 * 2 = 1 2 2 1 * = 4 2 8 16

  42. Vectorizer friendly 1 2 2 * = 1 2 * 2 = 2 = 1 * 2 * 2 = 1 2 1 2 4 2 * = = 2 1 8 2 4 * 2 1 * = 4 2 4 2 1 * = 4 2 8 16

  43. __kernel void vectorDotMul(int* vecA, int* vecB, intsize, int* result){ int id = get_local_id(0); __localvolatile intpartialSum[MAX_SIZE]; intlocalSize = get_local_size(0); int work = size/localSize; for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if(id == 0) *result = 0;for (inti=0; i<localSize; ++i) *result += partialSum[i];} Work item calculation Reduction

  44. Predication __kernel void mul(int size, int* a, int* b, int* c) { intid = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } }

  45. Predication for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } } How can we vectorize the loop?

  46. Predication for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id]; c[id] = (mask) ? c1 : c2; }

  47. Predication for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask); __mm_store_ps(c + id, c3_vec); }

  48. General tweaking • Consecutive memory accesses • SIMD, WARP • How can we vectorize with control flow? • Can we somehow create an efficient code with control flow? • Uniform CF • CF diverge in SIMD size • Enough work groups to utilize machine

  49. Architecture tweaking • CPU • Locality • No local memory (also slow in some GPUs) • Enough compute for a work group • Overcome thread creation overhead • GPU • Use local memory • Avoid bank conflicts

  50. Conclusion • OpenCL* is an open standard that lets developers: • Write the same code for any type of processor • Use all existing resources of a platform in their application • Automatic parallelism • OpenCL* applications are automatically portable and forward compatible • OpenCL* is still an expert tool • OpenCL* is not performance portable • Tweaking for each vendor should be done

More Related