1 / 52

OpenCL

OpenCL. Ryan Renna. Overview. Introduction History Anatomy of OpenCL Execution Model Memory Model Implementation Applications The Future. Goals. Knowledge that is transferable to all APIs Overview of concepts rather than API specific terminology

judson
Download Presentation

OpenCL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenCL Ryan Renna

  2. Overview Introduction History Anatomy of OpenCL Execution Model Memory Model Implementation Applications The Future

  3. Goals Knowledge that is transferable to all APIs Overview of concepts rather than API specific terminology Avoid coding examples as much as possible

  4. Introduction

  5. A Language: Open Computer Language, it’s C like! Execute code across mixed platforms consisting of CPUs, GPUs and other processors. An API: Runs on the “Host”, manipulate and control OpenCL objects and code. Deals with devices as abstract processing units What is OpenCL

  6. Why Use GPUs? • Modern GPUs are made up of highly parallelizable processing units. Have been named “Stream Processors” • Modern pc’s all have dedicated GPUs which sit idle for most of the day to day processing • This strategy is known as “General-Purpose Computation on Graphical Processing Units” or GPGPU

  7. The Stream Processor Did you know: The Cell processor, invented by Toshiba, Sony & IBM is a Stream Processor? Any device capable of Stream Processing, related to SIMD Given a set of data (the stream) a series of functions (called Kernel functions) are applied to each element On-chip memory is used, to minimize external memory bandwidth

  8. Most commonly 2D grids (Textures) Maps well to Matrix Algebra, Image Processing, Physics simulations, etc Streams Did you know: The latest ATI card has 1600 individual Stream Processors?

  9. Kernel Functions Traditional sequential method: for(int i = 0; i < 100 * 4; i++) { result[i] = source0[i] + source1[i]; } The same process, using the kernel “vector_sum” for(intel = 0; el < 100; el++) {vector_sum(result[el],source0[el],source1[el]); }

  10. Multiple CPU machines with multiple GPUs, all from different vendors, can work together. An “Open”Computing Language

  11. History

  12. General-Purpose Computation on Graphical Processing Units Coined in 2002, with the rise of using GPUs for non-graphics applications Hardware specific GPGPU APIs have been created : CUDA NVidia 2007 Close To Metal ATI 2006 GPGPU

  13. General-Purpose Computation on Graphical Processing Units Coined in 2002, with the rise of using GPUs for non-graphics applications Hardware specific GPGPU APIs have been created : CUDA NVidia 2007 Close To Metal ATI 2006 GPGPU ?

  14. OpenCL: Developed by Apple computers Collaborated with AMD, Intel, IBM and NVidia to refine the proposal Submitted to the Khronos Group The specification for OpenCL 1.0 was finished 5 months later The next step

  15. OpenGL – 2D and 3D graphics API OpenAL – 3D audio API OpenGL ES – OpenGL for embedded system. Used in all smartphones. Collada – XML-based schema for storing 3D assets. You may remember me from such open standards as…

  16. Anatomy of OpenCL

  17. Compute Device A processor that executes data-parallel programs. Contains Compute Units Compute Unit A Processing element. Example: a CORE of a CPU Queues Submits work to a compute device. Can be in-order or out-of-order. Context Collection of compute devices. Enables memory sharing across devices. Host Container of Contexts. Represents the computer itself. API – Platform Layer

  18. A host computer with one device group A Dual-core CPU A GPU with 8 Stream Processors Host Example

  19. API – Runtime Layer • Memory Objects • Buffers • Blocks of memory, accessed as arrays, pointers or structs • Images • 2D or 3D images • Executable Objects • Kernel • A data-parallel function that is executed by a compute device • Program • A group of kernels and functions • Synchronization: • Events Caveat: Each image can be read or written in a kernel, but not both.

  20. Example Flow Compute Device In-Order Queue Out-of-Order Queue

  21. Execution Model of OpenCL

  22. N-D Space • The N-Dimensional computation domain is called the N-D Space, defines the total number of elements of execution • Defines the Global Dimensions • Each element of execution, representing an instance of a kernel, is called a work-item • Work-items are grouped in local workgroups • Size is defined by Local Dimensions

  23. Work-Items • Global work-items don’t belong to a workgroup and run in parallel independently (no synchronization) • Local work-items can be synchronized within a workgroup, and share workgroup memory • Each work-item runs as it’s own thread • Thousands of lightweight threads can be running at a time, and are managed by the device • Each work-item is assigned a unique id, a local id within it’s workgroup and naturally each workgroup is assigned a workgroup id

  24. Example – Image Filter Executed on a 128 x 128 image, our Global Dimensions are 128, 128. We will have 16,384 work-items in total. We can then define a Local Dimensions of 30, 30. Since workgroups are executed together, and work-items can only be synchronized within workgroups, picking your Global and Local Dimensions is problem specific. If we asked for the local id of work-item 31, we’d receive 1. As it’s the 1st work-item of the 2nd workgroup.

  25. Memory Model of OpenCL

  26. Private Per work-item Local Shared within a workgroup Global/Constant Not synchronized, per device Host Memory Memory Model Compute Device .. Compute Unit 1 .. Compute Unit 2 Private Private Private Private Work Item Work Item Work Item Work Item Local Memory Local Memory Global / Constant Memory Host Host Memory

  27. Intermission

  28. Implementation

  29. Identifying Parallelizable Routines • Key thoughts: • Work-items should be independent of each other • Workgroups share data, but are executed in sync, so they cannot depend on each others results • Find tasks that are independent and highly repeated, pay attention to loops • Transferring data over a PCI bus has overhead, parallelization is only justified for large data sets, or ones with lots of mathematical computations

  30. An Example – Class Average • Let’s imagine we were writing an application that computed the class average • There are two tasks we’d need to perform: • Compute the final grade for each student • Obtain a class average by averaging the final grades

  31. An Example – Class Average Parallelizable Non-Parallelizable • Let’s imagine we were writing an application that computed the class average • There are two tasks we’d need to perform: • Compute the final grade for each student • Obtain a class average by averaging the final grades

  32. Pseudo Code Foreach(student in class) { grades = student.getGrades(); sum = 0; count = 0; foreach(grade in grades) { sum += grade; count++; } student.averageGrade = sum/count; } • Compute the final grade for each student

  33. Pseudo Code Foreach(student in class) { grades = student.getGrades(); sum = 0; count = 0; foreach(grade in grades) { sum += grade; count++; } student.averageGrade = sum/count; } _kernel void calcGrade (__globalconst float*input,__global float* output) { int i = get_global_id(0); //Do work on class[i] } • This code can be isolated.

  34. Determining the Data Dimensions First decide how to represent your problem, this will tell you the dimensionality of your Global and Local dimensions. Global dimensions are problem specific Local dimensions are algorithm specific Local dimensions must have the same number of dimensions as Global. Local dimensions must divide the global space evenly Passing NULL as a workgroup size argument will let OpenCL pick the most efficient setup, but no synchronization will be possible between work-items

  35. Execution Steps Warning! Code Ahead An OpenCL calculation needs to perform 6 key steps: Initialization Allocate Resources Creating Programs/Kernels Execution Read the Result(s) Clean Up

  36. Initialization constchar* Kernel_Source= "\n "__calcGrade(__global const float* input,__global float* output) { inti = get_global_id(0); //Do work on class[i] }”; Store Kernel in string/char array

  37. Initialization cl_int err; Cl_context context; cl_device_id devices; cl_command_queuecmd_queue; err = clGetDeviceIDs(CL_DEVICE_TYPE_GPU,1,&devices,NULL); context = clCreateContext(0,1,&devices,NULL,NULL,&err); cmd_queue = clCreateCommandQueue(context,devices,0,NULL); Selecting a device and creating a context in which to run the calculation

  38. Allocation cl_memax_mem = clCreateBuffer(context,CL_MEM_READ_ONLY,atom_buffer_size,NULL,NULL); err = clEnqueueWriteBuffer(cmd_queue,ax_mem,CL_TRUE,0,atom_buffer_size,(void*)values,0,NULL,NULL); Allocation of memory/storage that will be used on the device and push it to the device

  39. Program/Kernel Creation cl_program program[1]; cl_kernel kernel[1]; Program[0] = clCreateProgramWithSource(context,1,(const char**)&kernel_source,NULL,&err); err = clBuildProgram(program[0],NULL,NULL,NULL,NULL); Kernel[0]= clCreateKernel(program[0],”calcGrade”,&err); Programs and Kernels are read in from source and loaded as binary

  40. Execution size_tglobal_work_size[1],local_work_size[1]; global_work_size[0] = x; local_work_size[0] = x/2; err = clSetKernelArg(kernel[0],0,sizeof(cl_mem),&values); err = clEnqueueNDRangeKernel(cmd_queue,kernel[0],1,NULL,&global_work_size,&local_work_size,NULL,NULL); Arguments to the kernel are set and the kernel is executed on all data

  41. Read the Result(s) err = clEnqueueReadBuffer(cmd_queue,val_mem,CL_TRUE,0,grid_buffer_size,val,0,NULL,NULL); Note: If we were working on images, the function clEnqueueReadImage() would be called instead. We read back the results to the Host

  42. Clean Up clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(cmd_queue); clReleaseContext(context); Clean up memory, release all OpenCL objects. Can check OpenCL reference count and ensure it equals zero

  43. Advanced Techniques Instead of finding the first GPU, we could create a context out of all OpenCL devices, or decide to use specific dimensions / devices which would perform best on the devices dynamically. Debugging can be done more efficiently on the CPU then on a GPU, prinf functions will work inside a kernel

  44. Applications

  45. Raytracing Weather forecasting, Climate research Physics Simulations Computational finance Computer Vision Signal processing, Speech processing Cryptography / Cryptanalysis Neural Networks Database operations …Many more! Applications

  46. The Future

  47. OpenCL + OpenGL Efficient, inter-API communication OpenCL efficiently shares resources with OpenGL (doesn’t copy) OpenCL objects can be created from OpenGL objects OpenGL 4.0 has been designed to align both standards to closely work together Example Implementation: OpenGL Interoperability

  48. Competitor • DirectCompute by Microsoft • Bundled with DirectX 11 • Requires a DX10 or 11 graphic card • Requires Windows Vista or 7 • Close to OpenCL feature wise • Internet Explorer 9 and Firefox 3.7 both use DirectX to speed up dom tree rendering (Windows Only)

  49. Overview • With OpenCL • Leverage CPUs, GPUs and other processors to accelerate parallel computation • Get dramatic speedups for computationally intensive applications • Write accelerated portable code across different devices and architectures

  50. Getting Started… • ATI Stream SDK • Support for OpenCL/OpenGL interoperability • Support for OpenCL/DirectX interoperability • http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx • CudaToolkit • http://developer.nvidia.com/object/cuda_3_0_downloads.html • OpenCL.NET • OpenCL Wrapper for .NET languages • http://www.hoopoe-cloud.com/Solutions/OpenCL.NET/Default.aspx

More Related