1 / 41

GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis

GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis. Imagery taken from Nvidia Dawn Demo Slide on GPUs, CUDA and Programming Models by Apollo Ellis Slides on OpenCL by Bharadwaj Subramanian. A GPU is a Multi-core Architecture.

elkan
Download Presentation

GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPUs and GPU ProgrammingBharadwaj Subramanian, Apollo Ellis Imagery taken from Nvidia Dawn Demo Slide on GPUs, CUDA and Programming Models by Apollo Ellis Slides on OpenCLby Bharadwaj Subramanian

  2. A GPU is a Multi-core Architecture • High throughput is prioritized over low latency single task execution • Large collection of fixed function and software programmable resources • Applications • Originally Graphics • Data Parallel Compute with CUDA

  3. Graphics Pipeline • Virtual scene Virtual camera used to render • Direct3D and OpenGL formulate the process as a pipeline of operations on fundamental entities • Vertices • Primitives • Fragments • Pixels • Data flows in entity streams between pipeline stages.

  4. Graphics Pipeline

  5. Graphics Pipeline • GPU Front End • Otherwise known as Vertex Generator • Takes in vertex descriptors: Location plus Type (Line, Triangle, Quad, Poly) • Attributes (Normal, Texture Coordinate, Color etc.) • Performs a prefetch on the vertex data and constructs a vertex stream.

  6. Graphics Pipeline

  7. Graphics Pipeline • Vertex Processing • Programmable Vertex Shader Execute • Typically converts from world space to camera space • Languages include Cg and HLSL • Primitive Assembly • Convert form vertices to primitives • Rasterization • Primitive Sampler in Screen space • Fragment Generator

  8. Graphics Pipeline

  9. Graphics Pipeline • Fragment Processing • Programmable Fragment Shader Execute • Texture Lookup and Light Interaction Calculation • Cg and HLSL • ROP • Raster Operations (Depth Buffer Cull, Alpha Blend) • Calculate each fragment’s contribution to given pixels

  10. Graphics Pipeline

  11. Shader Programming • Fragment or Vertex processing is defined by shader programs written in Cg or GLSL or HLSL • Compiled at runtime to binary • Or compiled offline and then transformed at runtime • C-like function that processes a single input and output in isolation • Run in parallel on multiple cores • Wide SIMD instructions due to instruction streaming

  12. Parallel Processing and Encapsulation • Task Parallelism is available across stages • E.g. Vertices are processed while fragments processed etc. • Data Parallelism is available across stream entities. • Each entity is independent of each other because of the task offloading onto the fixed function units • Fixed Function Units encapsulate hard to parallelize work in optimized hardware components

  13. Still A Scheduling Problem • Processing and on-chip resources must be dynamically reallocated to pipeline stages • Depends on the current loads at different stages • How to determine if different stages get more cores or more cache becomes an issue. • Hardware Multithreading provides a solution to thread stalls distributes resources more evenly • Can we make t a more general model?

  14. CUDA • CUDA is a more general data parallel model • No Pipeline • Clusters of Threads • Scatter Operations (Multiple Write) • Gather Operations (Multiple Read) • Application based decomposition of threads • Threads can share data and communicate with each other

  15. CUDA Programming Model • GPU is viewed as a coprocessor with DRAM and many parallel threads • Data parallel portions of applications can be offloaded onto this coprocessor • C on the GPU • Global and Shared Variables • Pointers and Explicit Memory Allocation • OpenGL and DirectX interoperability

  16. CUDA Architecture = Tesla

  17. Tesla Architecture • Scalable array of multithreaded Streaming Multiprocessors or SMs 768 to 12,288 concurrent threads

  18. CUDA Kernels • C C++ Simple Functions or Full Programs • Consists of thread blocks and grids • Thread Block • Set of concurrent threads that cooperate through barriers and shared memory. • Grid • Set of thread blocks that are independent form each other • Multiple Grids per Kernel

  19. Syntax Example • __global__ void my_par_func(float a){ do something with a } intdimGrid = 256, dimBlock 256 my_par_func<<<dimGrid,dimBlock>>>(5.0f)

  20. Execution • SIMT Single Instruction Multiple Thread Model Scheduler schedules Warps or sets of concurrent threads on SM units. • Warp is scheduled independently of other warps • If a Warps threads diverge in control flow path the paths are each executed turning off the threads that are not effected • No recursion is allowed for stack space issues

  21. SIMD vs SIMT • CUDA utilizes the wide SIMD units • However SIMD is not exposed to the programmer • Instead SIMD units are used by multiple threads at once • SIMT utilizes SIMD

  22. CUDA Wrap Up • More general model using same hardware • GPU is a CUDA coprocessor • Tesla Architecture 768 to 12000+ threads • C C++ syntax • Serial Branching • No recursion • SIMD used by SIMT

  23. Another Model GRAMPS • General Runtime Architecture for Multicore Parallel Systems • A programming model for graphics pipelines • Allows for custom pipelines mixing fixed function and programmable stages • Data is exchanged using queues and buffers • Motivation comes from hybrid applications • REYES Rasterization and Ray Tracing

  24. Execution Graphs • Analog of a GPU pipeline • Made up of Stages • Provides scheduling information • Not limited to execution DAGs • Cycles are not forbidden • Forward progress is not guaranteed • Flexibility presumably outweighs the cost of well behaved program assurance

  25. Execution Graphs

  26. Stages • Types SHADER THREAD FIXEDFUNCTION • Operate asynchronously exposing parallelism • Indicate similarities in data access and execution characteristics for efficient processing • Useful when benefits of coherent execution outweigh deferred processing

  27. Shader • Short lived run to completion computations • Per element programs • Push operation introduced for conditional output • Otherwise queue inputs and outputs are managed automatically • Shader instances are scheduled in packets similar to GPU execution

  28. Threads and Fixed Function • Threads • Similar to CPU threads designed for task parallelism • Must be manually parallelized by the application • Useful for repacking data between Shader stages and processing bulk chunks of data where sharing or cross communication is needed • Fixed Function • Hardware unit wrappers

  29. Buffers and Queues • Buffers • essentially shared memory across stages • Queues • Packets are the primitive data format of the queue defined at creation • Opaque packets: are for data chunks which need not be interpreted • Collection packets: for shader group dispatch • Queue Manipulation • Thread/Fixed Stages • Shader Stages

  30. Thread Fixed Stages • reserve-commit • reserve: returns called a reference to one or more contiguous packets a reservation is also acquired • commit: is a notification that releases the referenced data back to the system • Input commit means packet has been consumed • Output commit means packet can go downstream

  31. Shader Stages • Queue ops are transparent to the user • As input packets arrive output reservations are attained • When all shader instances for a collection packer are done the commits happen automatically • Queue Sets are introduced • Groups of queues viewed as single queues for sharing among shaders

  32. Summary GRAMPS • Application creates stages, queues, and buffers. • Queues and buffer are bound to stages • Computation proceeds according to execution graphs • Computation graphs are fully programmable • Dynamic aggregation of work at runtime

  33. Bonus CUDA Killer App • Ray Tracing • Cast rays through a virtual camera at a virtual scene • Bounce them around until a light source is reached • Color the pixel appropriately • Hierarchical Traversal • Dividing the scene primitives into nodes storing them in a tree • Could even be an Octree as in BarnesHut but with triangles • Scope • We assume the hierarchy is in place • Focusing only on the ray traversal of the hierarchy and intersecting of primitives • We label traversal and intersect as trace() • trace() is actually bottlenecked on CUDA architecture

  34. Bonus CUDA Killer App • trace() on CUDA Architectures not bottlenecked by compute or by memory bandwidth • At least not in entirety • What could be this other bottleneck? • What could we do to solve? • Consider: • One Ray Traverses the tree and intersects a given number of times • Another ray in the packet takes a different branch in the tree and intersects and different number of times

  35. Bonus CUDA Killer App • Remember CUDA architecture is optimized for homogeneity across computations. • Data Parallelism • Unfortunately this also requires compute parallelism or homogenous work units • Ray Tracing is ill-suited in this fashion • Bottleneck = Load Balancing

  36. Bonus CUDA Killer App • How to resolve? • Persistent Threads • Atomic Operations • Global Work Queue • Threads fetch work as it becomes available • Old Model = launch warps with equal work • New Model = distributes the work as necessary

  37. CUDA Killer App Summary • Ray Tracing in CUDA is bottlenecked at Load Balancing • Most likely among other things • Ray computation diverges • The CUDA model is failing here • In this case we design a model around a model • It is just a specialization

  38. References • Nickolls, J., Buck, I., Garland, M., and Skadron, K. 2008. Scalable parallel programming with CUDA. In ACM SIGGRAPH 2008 Classes (Los Angeles, California, August 11 - 15, 2008). SIGGRAPH '08. ACM, New York, NY, 1-14. DOI= http://doi.acm.org/10.1145/1401132.1401152 • Fatahalian, K. and Houston, M. 2008. GPUs a closer look. In ACM SIGGRAPH 2008 Classes (Los Angeles, California, August 11 - 15, 2008). SIGGRAPH '08. ACM, New York, NY, 1-11. DOI= http://doi.acm.org/10.1145/1401132.1401147 • Buck, I. 2007. GPU computing with NVIDIA CUDA. In ACM SIGGRAPH 2007 Courses (San Diego, California, August 05 - 09, 2007). SIGGRAPH '07. ACM, New York, NY, 6. DOI= http://doi.acm.org/10.1145/1281500.1281647 • Jeremy Sugerman, KayvonFatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan: GRAMPS: A programming model for graphics pipelines. ACM Trans. Graph. 28(1): (2009) • TimoAila, SamuliLaine: Understanding the Efficiency of Ray Traversal on GPUs (2009)

More Related