1 / 27

NVIDIA Kepler Architecture

NVIDIA Kepler Architecture. Paul Bissonnette Rizwan Mohiuddin Ajith Herga. Compute Unified Device Architecture. Hybrid CPU/GPU Code Low latency code is run on CPU Result immediately available High latency, high throughput code is run on GPU Result on bus

yaron
Download Presentation

NVIDIA Kepler Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NVIDIA Kepler Architecture Paul Bissonnette RizwanMohiuddin AjithHerga

  2. Compute Unified Device Architecture • Hybrid CPU/GPU Code • Low latency code is run on CPU • Result immediately available • High latency, high throughput code is run on GPU • Result on bus • GPU has many more cores than CPU

  3. CPU/GPU Code CUDA Program GPU Routines CPU Routines NVCC GCC Linker CUDA Binary GPU Object CPU Object

  4. Execution Model (Overview) Result CPU GPU CPU GPU RPC RPC Intermediate Result Result CPU GPU

  5. Execution Model (GPU) Thread Grid Thread Block Thread Block Thread Block Thread Thread Thread Streaming Multiple Processor Graphics Card

  6. Execution Model (GPU) • Each procedure runs as a “kernel” • An instance of a kernel runs on a thread block • A thread block executes on a single streaming multiple processor • All instances of a particular kernel form a thread grid • A thread grid executes on a single graphics card across several streaming multiple processors

  7. Thread Cooperatively • Multiple levels of sharing • Thread blocks similar to MPI group

  8. GPU Execution of Kernels • In Kepler threads can spawn new thread blocks/grids • Less time spent in CPU • More natural recursion • Completion dependent on child grids

  9. CUDA Languages • CUDA C/C++ and CUDA Fortran • Scientific computing • Highly parallel applications • NVIDIA specific (unlike OpenCL) • Specialized for specific tasks • Highly optimized single precision floating point • Specialized data sharing instructions within thread blocks

  10. HYPER Q Without HYPER Q: Availability of only one work queue thus can receive work only from one queue. Difficult for a CPU core to keep a GPU busy.

  11. Using HYPER Q: • Allows connection from multiple CUDA streams, Message Passing Interface (MPI) processes, or multiple threads of the same process. • 32 concurrent work queues, can receive work from 32 process cores at the same time. • 3X Performance increase on Fermi

  12. Removes the problem of false intra-stream dependencies.

  13. Dynamic Parallelism • Without Dynamic Parallelism • Data travels back and forth between the CPU and GPU many times. • This is because of the inability of the GPU to create more work on itself depending on the data.

  14. With Dynamic Parallelism: • GPU can generate work on itself based intermediate results, without involvement of CPU. • Permits Dynamic Run Time decisions. • Leaves the CPU free to do other work, conserves power.

  15. Application Example: Adaptive Grid Simulation

  16. Application Example: Quick Sort Computation Streams spawning Streams CPU launches quicksort

  17. CPU-GPU Stack Exchange Runs on CPU Looping based on intermediate results Check if GPU has returned any more intermediate results CPU spawns a stream to be computed on GPU

  18. Memory Organization

  19. Memory Organization

  20. Core Stream

  21. Stream Processor

  22. Kepler Architecture

  23. Scheduling

  24. Warp Scheduler

  25. Thread Block level/Grid Scheduling

  26. References • NVIDIA Whitepapers • http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf • http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf • NVIDIA Keynote Presentation • http://www.youtube.com/watch?v=TxtZwW2Lf-w • Georgia Tech Presentation • http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf • http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/4 • http://gpuscience.com/code-examples/tesla-k20-gpu-quicksort-with-dynamic-parallelism

More Related