1 / 34

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008. Introduction. Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan Initial work appearing in ACM TOG in January, 2009 Our starting point: CPU, GPU trends… and collision?

kolton
Download Presentation

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Many-Core Programming with GRAMPSJeremy SugermanStanford PPL RetreatNovember 21, 2008

  2. Introduction • Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan • Initial work appearing in ACM TOG in January, 2009 Our starting point: • CPU, GPU trends… and collision? • Two research areas: • HW/SW Interface, Programming Model • Future Graphics API 2

  3. Background Problem Statement / Requirements: • Build a programming model / primitives / building blocks to drive efficient development for and usage of future many-core machines. • Handle homogeneous, heterogeneous, programmable cores, and fixed-function units. Status Quo: • GPU Pipeline (Good for GL, otherwise hard) • CPU / C run-time (No guidance, fast is hard) 3

  4. Frame Buffer FB Blend Shade Rasterize Raster Graphics Ray Queue Camera Intersect Ray Hit Queue Ray Tracer Fragment Queue Frame Buffer = Thread Stage = Queue FB Blend Shade = Stage Output = Shader Stage = Fixed-func Stage GRAMPS Output Fragment Queue Input Fragment Queue • Apps: Graphs of stages and queues • Producer-consumer, task, data-parallelism • Initial focus on real-time rendering 4

  5. Design Goals Large Application Scope– preferable to roll-your-own High Performance– Competitive with roll-your-own Optimized Implementations– Informs HW design Multi-Platform– Suits a variety of many-core systems Also: Tunable– Expert users can optimize their apps

  6. As a Graphics Evolution • Not (unthinkably) radical for ‘graphics’ • Like fixed → programmable shading • Pipeline undergoing massive shake up • Diversity of new parameters and use cases • Bigger picture than ‘graphics’ • Rendering is more than GL/D3D • Compute is more than rendering • Some ‘GPUs’ are losing their innate pipeline 6

  7. As a Compute Evolution (1) • Sounds like streaming: Execution graphs, kernels, data-parallelism • Streaming: “squeeze out every FLOP” • Goals: bulk transfer, arithmetic intensity • Intensive static analysis, custom chips (mostly) • Bounded space, data access, execution time 7

  8. As a Compute Evolution (2) GRAMPS: “interesting apps are irregular” Goals: Dynamic, data-dependent code Aggregate work at run-time Heterogeneous commodity platforms Streaming techniques fit naturally when applicable 8

  9. GRAMPS’ Role • A ‘graphics pipeline’ is now an app! • Target users: engine/pipeline/run-time authors, savvy hardware-aware systems developers. • Compared to status quo: • More flexible, lower level than a GPU pipeline • More guidance than bare metal • Portability in between • Not domain specific 9

  10. GRAMPS Entities (1) Data access via windows into queues/memory Queues: Dynamically allocated / managed Ordered or unordered Specified max capacity (could also spill) Two types: Opaque and Collection Buffers: Random access, Pre-allocated RO, RW Private, RW Shared (Not Supported)

  11. GRAMPS Entities (2) Queue Sets: Independent sub-queues Instanced parallelism plus mutual exclusion Hard to fake with just multiple queues

  12. Design Goals (Reminder) Large Application Scope High Performance Optimized Implementations Multi-Platform (Tunable)

  13. What We’ve Built (System) 13

  14. GRAMPS Scheduling Static Inputs: Application graph topology Per-queue packet (‘chunk’) size Per-queue maximum depth / high-watermark Dynamic Inputs (currently ignored): Current per-queue depths Average execution time per input packet Simple Policy: Run consumers, pre-empt producers

  15. GRAMPS Scheduler Organization • Tiered Scheduler: Tier-N, Tier-1, Tier-0 • Tier-N only wakes idle units, no rebalancing • All Tier-1s compete for all queued work. • ‘Fat’ cores: software tier-1 per core, tier-0 per thread • ‘Micro’ cores: single shared hardware tier-1+0 15

  16. = Thread Stage = Queue = Stage Output = Shader Stage = Push Output = Fixed-func What We’ve Built (Apps) Direct3D Pipeline (with Ray-tracing Extension) Input Vertex Queue 1 Primitive Queue 1 Primitive Queue Fragment Queue Sample Queue Set Frame Buffer Vertex Buffers IA 1 VS 1 Rast PS OM RO … … Ray Queue IA N VS N Ray Hit Queue Primitive Queue N Input Vertex Queue N Trace PS2 Ray-tracing Extension Ray-tracing Graph Tile Queue Sample Queue Ray Queue Tiler Sampler Intersect Camera Ray Hit Queue Fragment Queue Frame Buffer FB Blend Shade

  17. Initial Renderer Results • Queues are small (< 600 KB CPU, < 1.5 MB GPU) • Parallelism is good (at least 80%, all but one 95+%) 17

  18. Scheduling Can Clearly Improve 18

  19. Taking Stock: High-level Questions • Is GRAMPS a suitable GPU evolution? • Enable pipeline competitive with bare metal? • Enable innovation: advanced / alternative methods? • Is GRAMPS a good parallel compute model? • Does it fulfill our design goals? 19

  20. Possible Next Steps Simulation / Hardware fidelity improvements Memory model, locality GRAMPS Run-Time improvements Scheduling, run-time overheads GRAMPS API extensions On-the-fly graph modification, data sharing More applications / workloads REYES, physics, finance, AI, … Lazy/adaptive/procedural data generation

  21. Design Goals (Revisited) Application Scope: okay– only (multiple) renderers High Performance: so-so– limited simulation detail Optimized Implementations: good Multi-Platform: good (Tunable: good, but that’s a separate talk) Strategy: Broaden available apps and use them to drive performance and simulation work for now.

  22. Digression: Some Kinds of Parallelism Task (Divide) and Data (Conquer) Subdivide algorithm into a DAG (or graph) of kernels. Data is long lived, manipulated in-place. Kernels are ephemeral and stateless. Kernels only get input at entry/creation. Producer-Consumer (Pipeline) Parallelism Data is ephemeral: processed as it is generated. Bandwidth or storage costs prohibit accumulation.

  23. Three New Graphs “App” 1: MapReduce Run-time Popular parallelism-rich idiom Enables a variety of useful apps App 2: Cloth Simulation (Rendering Physics) Inspired by the PhysBAM cloth simulation Demonstrates basic mechanics, collision detection Graph is still very much a work in progress… App 3: Real-time REYES-like Renderer (Kayvon)

  24. MapReduce: Specific Flavour “ProduceReduce”: Minimal simplifications / constraints Produce/Split (1:N) Map (1:N) (Optional) Combine (N:1) Reduce (N:M, where M << N or M=1 often) Sort (N:N conceptually, implementations vary) (Aside: REYES is MapReduce, OpenGL is MapCombine)

  25. MapReduce Graph Map output is a dynamicallyinstanced queue set. Combine might motivate a formal reduction shader. Reduce is an (automatically) instanced thread stage. Sort may actually be parallelized. Output Initial Tuples Intermediate Tuples Intermediate Tuples Final Tuples Produce Map Combine (Optional) Reduce Sort = Thread Stage = Queue = Stage Output = Shader Stage = Push Output

  26. Update is not producer-consumer! Broad Phase will actually be either a (weird) shader or multiple thread instances. Fast Recollide details are also TBD. Cloth Simulation Graph Collision Detection Broad Collide Resolve Candidate Pairs Moved Nodes Collisions Narrow Collide Fast Recollide BVH Nodes Resolution Proposed Update Update Mesh = Thread Stage = Queue = Stage Output = Shader Stage = Push Output

  27. That’s All Folks Thank you for listening. Any questions? Actively interested in new collaborators Owners or experts in some application domain (or engine / run-time system / middleware). Anyone interested in scheduling or details of possible hardware / core configurations. TOG Paper: http://graphics.stanford.edu/papers/gramps-tog/

  28. Backup Slides / More Details

  29. Designing A Good Graph Efficiency requires “large chunks of coherent work” Stages separate coherency boundaries Frequency of computation (fan-out / fan-in) Memory access coherency Execution coherency Queues allow repacking, re-sorting of work from one coherency regime to another.

  30. GRAMPS Interfaces Host/Setup: Create execution graph Thread: Stateful, singleton Shader: Data-parallel, auto-instanced 30

  31. GRAMPS Graph Portability Portability really means performance. Less portable than GL/D3D GRAMPS graph is (more) hardware sensitive More portable than bare metal Enforces modularity Best case, just works Worst case, saves boiler plate 31

  32. Possible Next Steps: Implementation Better scheduling Less bursty, better slot filling Dynamic priorities Handle graphs with loops better More detailed costs Bill for scheduling decisions Bill for (internal) synchronization More statistics 32

  33. Possible Next Steps: API Important: Graph modification (state change) Probably: Data sharing / ref-counting Maybe: Blocking inter-stage calls (join) Maybe: Intra/inter-stage synchronization primitives 33

  34. Possible Next Steps: New Workloads REYES, hybrid graphics pipelines Image / video processing Game Physics Collision detection or particles Physics and scientific simulation AI, finance, sort, search or database query, … Heavy dynamic data manipulation k-D tree / octree / BVH build lazy/adaptive/procedural tree or geometry 34

More Related