1 / 31

GPU Computing with Matlab® @ CBI Laboratory

GPU Computing with Matlab® @ CBI Laboratory. Overview. GPU History & Hardware GPU History CPU vs. GPU Hardware Parallelism Design Points GPU Software Infrastructure ( CUDA ) Matlab Parallel Computing Toolbox, GPU Computing GPU nodes @ CBI Lab Examples Additional Features. GPU History.

misae
Download Presentation

GPU Computing with Matlab® @ CBI Laboratory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Computing with Matlab® @ CBI Laboratory

  2. Overview • GPU History & Hardware • GPU History • CPU vs. GPU Hardware • Parallelism Design Points • GPU Software Infrastructure ( CUDA ) • Matlab Parallel Computing Toolbox, GPU Computing • GPU nodes @ CBI Lab • Examples • Additional Features

  3. GPU History 3D object model: e.g. A circle of radius R, @ center (x,y,z) Color = Blue Light Source @ ( x,y,z ) 2 Dimensional Screen Goal: Answer question, for pixel (X,Y) on the screen, what’s my (R,G,B) value

  4. GPU History 3D object model: e.g. A circle of radius R, @ center (x,y,z) Color = Blue Light Source @ ( x,y,z ) 2 Dimensional Screen Much Parallelism Available & Screen refresh rate << Processor Clock rate

  5. GPU History 3D object model: e.g. A circle of radius R, @ center (x,y,z) Color = Blue Light Source @ ( x,y,z ) GPU Model: Assembly Line Concept High Latency BUT High Throughput 2 Dimensional Screen

  6. GPU History MATRIX MULTIPLICATION: e.g. 3-D to 2-D Projection ( Perspective Projection ) MATRIX MULTIPLICATION: e.g. Translation, Rotation, Scaling MATRIX MULTIPLICATION: e.g. Rotation 3d 3d 3d 2d Many Independent Computations: Streams of Triangles & Vertices screen 3 vertices (x1,y1,z1) (x2,y2,z2) (x3,y3,z3) The more calculators: the more points we can move around in the same amount of time

  7. GPU History MATRIX MULTIPLICATION: e.g. 3-D to 2-D Projection ( Perspective Projection ) MATRIX MULTIPLICATION: e.g. Translation, Rotation, Scaling MATRIX MULTIPLICATION: e.g. Rotation 3d 3d 3d 2D Many Independent Computations: Streams of Triangles & Vertices screen Why must we be limited to performing a single type of function? The answer involves the start of General Purpose GPU Computing. Allow the programmer to create custom functions ( a.k.a. kernels ) that run in parallel.

  8. GPU vs. CPU Different Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting Higher Latency Lower Latency Exceptionally High Throughput Good Throughput Which column maps to CPU and which to GPU? • An individual waits as little as possible in line. • Workers are always kept busy by having large local caches of supplies both at the store and at the work counters. • Subdivide 1 task into smaller tasks and increase the speed of each smaller task. ( ILP & Pipelining ) • Try to find parallelism within 1 task ( out-of-order execution ) • Try to predict what people may order to get a head start. ( Branch Prediction ) • Trying to optimize for minimum wait time for a single user uses up resources ( workers + space where you could have put more workers ) • An individual may need to wait a long time in line, but many more people go through system during the course of a day. • Workers are always kept busy, even if the current person say forgets a document and needs to wait for someone to deliver it, since there are many people waiting in line. • More workers/ smaller desks per worker. • Use as much of the building space as possible to add workers.

  9. GPU vs. CPU Different Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting Higher Latency Lower Latency Exceptionally High Throughput Good Throughput GPU CPU • An individual may need to wait a long time in line, but many more people go through system during the course of a day. • Workers are always kept busy, even if the current person say forgets a document and needs to wait for someone to deliver it, since there are many people waiting in line. • More workers/ smaller desks per worker. • Use as much of the building space as possible to add workers. • An individual waits as little as possible in line. • Workers are always kept busy by having large local caches of supplies both at the store and at the work counters. • Subdivide 1 task into smaller tasks and increase the speed of each smaller task. ( ILP & Pipelining ) • Try to find parallelism within 1 task ( out-of-order execution ) • Try to predict what people may order to get a head start. ( Branch Prediction ) • Trying to optimize for minimum wait time for a single user uses up resources ( workers + space where you could have put more workers )

  10. Parallelism Design Points • Key: Focus on dependency analysis • How much of your program is independent determines potential parallelism ( Amdahl’s Law ) …. For a fixed amount of work in the parallel section… • Gustafson’s Law: Do more work within parallel sections… • Data transfer vs. Compute ( Arithmetic Intensity ) • Cost of moving the data from CPU to GPU needs to be taken into account. • GPU may provide large benefit when ( compute >> data I/O ) • Going to the store to get 100 items with 10 workers: you ideally only want to make 1 trip for all 100 items • Even if all 10 workers go to get their items in parallel, not much benefit if you make 10 round trips. • Resource contention • Data transfer bandwidth

  11. Parallelism Design Points • Resource limits ( memory, disk ) • Hardware limits • Memory cache line sizes, Memory alignment issues, Disk block sizes, Cache sizes, # Queues, etc. • Physical data organization ( e.g. Row Major vs. Column Major ) • Conditional (if-else) minimization • Ideally you would hope to have 0 if statements in your functions…. Not always feasible for algorithm correctness. • Synchronization • Algorithm correctness many times requires some type of synchronization • Many more variables affect function, program, … as well as system level parallelism…. • A function may be highly parallelizable, but overall system parallelism may involve looking at different levels of parallel to achieve good solution.

  12. GPU Hardware Fermi Architecture[16] Many resources are available at www.nvidia.com

  13. GPU Hardware Fermi Architecture[16] Many resources are available at www.nvidia.com

  14. GPU Software Infrastructure CUDA: Compute Unified Device Architecture Applications ( e.g. Matlab ) CUDA C/C++ NVCC Compiler + Utilities ( nvprof, visual profiler ) PTX: Parallel Thread eXecution Assembly Language ( Virtual Machine ) CUBIN( Cuda Binary ) CUDA Libraries CUDA Runtime API CUDA Driver Operating System ( Linux, Windows, etc.) GPU card(s) & System Board with CPU, Buses ( PCIe ),..

  15. GPU Software Infrastructure CUDA: Compute Unified Device Architecture Software model: An abstraction of the hardware Streams: Compute & Data Transfer  GPU1,GPU2… Queues (order guaranteed within a single stream) Grids: Run the samekernel( a.k.a. function )  GPU1,GPU2… Blocks: Group of cooperating threads SM(Streaming Multi-processor ) - 32 compute cores per SM in Fermi Architecture. - Blocks should be viewed as self contained work units Warps: Groups of 32 threads  SM ( Streaming Multi-processor ) - The basic unit of execution, 32 threads running the same instruction in the same amount of time. Threads: Execution context ( keeps track a core’s state) Compute Core Software to Hardware Mapping

  16. Matlab Parallel Computing Toolbox, GPU Computing • gpuDevice(#) • gpuDeviceCount() • reset(gpuDevice(#)) • wait() • bsxfun() • gpuArray() • gather() • arrayfun() • existsOnGPU() • parallel.gpu.CUDAKernel() • feval • setConstantMemory • Many GPU enabled built-in functions: e.g. fft, …. Check with: • methods(‘gpuArray’) Matlab Parallel Computing Toolbox: Each release, more and more functions are enabled for transparentGPU support.

  17. Matlab Parallel Computing Toolbox, GPU Computing • Many GPU enabled built-in functions: e.g. fft, …. Check with: • methods(‘gpuArray’) • fft,fft2,…. Many built in functions • Try running >> methods(‘gpuArray’) to see the list of support functions.

  18. GPU Nodes @ CBI Lab Nvidia M2070: Fermi Architecture, 448 cuda cores, 14 Multiprocessors, @ 32 cuda cores/Multi Processor • 2 modes: Interactive & Batch • Interactive: Use for development • $ ssh –Y username@cheetah.cbi.utsa.edu$ qlogin -q gpu.q -l gpuonly$ matlab & Batch mode: For production runs • Job Script#!/bin/bash#$ -q gpu.q#$ -l gpuonly[Source: http://www.cbi.utsa.edu/faq/sge/gpu] Putty+Xming can be used to access Matlab GUI from Windows system. http://cbi.utsa.edu/faq/xforwarding

  19. GPU Nodes @CBI Lab Matlab GUI access is also available from Windows, using Putty + x11 forwarding with XMing qlogin –q gpu.q –l gpuonly

  20. GPU Nodes @ CBI Lab matlab & nvidia-smi top >> gpuDevice(#)

  21. GPU Nodes @ CBI Lab

  22. GPU Nodes @ CBI Lab M2070: Fermi Architecture, 448 CUDA cores, 14 Multiprocessors, @ 32 cores/Multi Processor

  23. Built-in function support for GPU Quickly solving sets of linear equations has applications throughout science & engineering. • 4x + y - 2z = 0 • 2x -3y + 3z = 9 • -6x -2y + z = 0 • A*x = b • A = [4 1 -2; 2 -3 3; -6 -2 1]; • b = [0; 9; 0]; • What is x? • x = A\b; x = [ 0.75, -2, 0.5 ]; 4*0.75 + (-2) – (2*0.5) = 0 ???  should match if correct solution of system 2*0.75 + (-3*-2) + (3*0.5 ) = 9 ???  should match if correct solution of system -6*0.75 + (-2*-2) + 0.5 = 0 ???  should match if correct solution of system \ operator is one of many functions that work on gpuArray data types.

  24. Many Additional Features • Using Matlab with GPU in Batch mode via Job Script • Calling .cu , .ptx code directly from Matlab • Using the GPU from C/C++ code directly with the MEX interface • Allows incorporating custom GPU code into Matlab as well as using Nvidia Nsight and Nvidia Visual Profiler for custom GPU algorithm development.

  25. Demo An example Matlab code running on a GPU system.

  26. Appendix Many applications are being enabled for GPU acceleration: e.g.NAMD for Molecular Dynamics using GPU http://www.nvidia.com/object/gpu-applications.html http://www.nvidia.com/content/tesla/pdf/gpu-accelerated-applications-for-hpc.pdf C/C++/Fortran Library: AccelereyesArrayfire https://developer.nvidia.com/accelereyes-arrayfire http://www.accelereyes.com/examples/case_studies

  27. Appendix CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization

  28. Appendix CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization

  29. References [1] http://www.mathworks.com/help/distcomp/release-notes.html [2] http://www.mathworks.com/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html [3] http://www.mathworks.com/help/distcomp/examples/illustrating-three-approaches-to-gpu-computing-the-mandelbrot-set.html [4] http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html [5] http://www.nvidia.com/docs/IO/105880/DS-Tesla-M-Class-Aug11.pdf [6] http://en.wikipedia.org/wiki/Nvidia_Tesla#cite_note-11 [7] http://en.wikipedia.org/wiki/Rasterisation [8] http://en.wikipedia.org/wiki/Perspective_projection#Perspective_projection [9] http://en.wikipedia.org/wiki/GPGPU [10] http://www.cbi.utsa.edu/faq/sge/gpu [11] http://medim.sth.kth.se/6l2872/F/F11c.pdf (FFT registration ) [12] http://medim.sth.kth.se/6l2872/F/F11c.pdf [13] http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [14] http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf [15] http://en.wikipedia.org/wiki/Nvidia_Tesla [16] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf [17] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf [18] https://www.udacity.com/wiki/cs344/Lesson_1_-_The_GPU_Programming_Model#latency-vs-bandwidth [19] https://www.udacity.com/wiki/cs344 [20] http://www.computingbook.org/FullText.pdf [21] http://en.wikipedia.org/wiki/Dynamic_random-access_memory [22] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2009/lec08-cache.html [23] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/computer-architecture-2012/lec03-fastest.html [24] http://en.wikipedia.org/wiki/Gustafson%27s_law [25] http://archive.hpcwire.com/hpc/705814.html [26] http://www.johngustafson.net/pubs/pub13/amdahl.pdf [27] http://spartan.cis.temple.edu/shi/public_html/docs/amdahl/amdahl.html [28] http://software.intel.com/en-us/articles/amdahls-law-gustafsons-trend-and-the-performance-limits-of-parallel-applications

  30. Acknowledgements • This project received computational, research & development, software design/development support from the Computational System Biology Core/Computational Biology Initiative, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health. URL: http://www.cbi.utsa.edu

  31. Contact Us http://cbi.utsa.edu

More Related