programming many core systems with gramps n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Programming Many-Core Systems with GRAMPS PowerPoint Presentation
Download Presentation
Programming Many-Core Systems with GRAMPS

Loading in 2 Seconds...

play fullscreen
1 / 71

Programming Many-Core Systems with GRAMPS

3 Views Download Presentation
Download Presentation

Programming Many-Core Systems with GRAMPS

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

  2. The single fast core era is over • Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ • Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!

  3. High-level programming models • Two major advantages over threads & locks • Constructs to express/expose parallelism • Scheduling support to help manage concurrency, communication, and synchronization • Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

  4. My biases workloads • Interesting applications have irregularity • Large bundles of coherent work are efficient • Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

  5. My target audience • Highly informed, but (good) lazy • Understands the hardware and best practices • Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

  6. Contributions: Design of GRAMPS • Programs are graphs of stages and queues • Queues: • Maximum capacities, Packet sizes • Stages: • No, limited, or total automatic parallelism • Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline

  7. Contributions: Implementation • Broad application scope: • Rendering, MapReduce, image processing, … • Multi-platform applicability: • GRAMPS runtimes for three architectures • Performance: • Scale-out parallelism, controlled data footprint • Compares well to schedulers from other models • (Also: Tunable)

  8. Outline • GRAMPS overview • Study 1: Future graphics architectures • Study 2: Current multi-core CPUs • Comparison with schedulers from other parallel programming models

  9. GRAMPS Overview

  10. GRAMPS • Programs are graphs of stages and queues • Expose the program structure • Leave the program internals unconstrained

  11. Cookie Dough Pipeline Writing a GRAMPS program • Design the application graph and queues: • Design the stages • Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html

  12. Queues • Bounded size, operate at “packet” granularity • “Opaque” and “Collection” packets • GRAMPS can optionally preserve ordering • Required for some workloads, adds overhead

  13. Thread (and Fixed) stages • Preemptible, long-lived, stateful • Often merge, compare, or repack inputs • Queue operations: Reserve/Commit • (Fixed: Thread stages in custom hardware)

  14. Shader stages: • Automatically parallelized: • Horde of non-preemptible, stateless instances • Pre-reserve/post-commit • Push: Variable/conditional output support • GRAMPS coalesces elements into full packets

  15. Cookie Dough Pipeline Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages

  16. Cookie Dough (with queue set) Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages

  17. In-place Shader stages /coalescing inputs • Instanced Thread stages • Queues as barriers /read all-at-once A few other tidbits

  18. Formative influences • The Graphics Pipeline, early GPGPU • “Streaming” • Work-queues and task-queues

  19. Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

  20. Graphics is a natural first domain • Table stakes for commodity parallelism • GPUs are full of heterogeneity • Poised to transition from fixed/configurable pipeline to programmable • We have a lot of experience in it

  21. The Graphics Pipeline in GRAMPS • Graph, setup are (application) software • Can be customized or completely replaced • Like the transition to programmable shading • Not (unthinkably) radical • Fits current hw: FIFOs, cores, rasterizer, …

  22. Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware

  23. The Experiment • Three renderers: • Rasterization, Ray Tracer, Hybrid • Two simulated future architectures • Simple scheduler for each

  24. Rasterization Pipeline (with ray tracing extension) Ray Tracing Extension Ray Tracing Graph Scope: Two(-plus) renderers

  25. Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

  26. Performance— Metrics “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Parallel utilization • Priority #2: ‘Reasonable’ bandwidth / storage • Worst case total footprint of all queues • Inherently a trade-off versus utilization

  27. Performance— Scheduling Simple prototype scheduler (both platforms): • Static stage priorities: • Only preempt on Reserve and Commit • No dynamic weighting of current queue sizes (Lowest) (Highest)

  28. Performance— Results • Utilization: 95+% for all but rasterized fairy (~80%). • Footprint: < 600KB CPU-like, < 1.5MB GPU-like • Surprised how well the simple scheduler worked • Maintaining order costs footprint

  29. Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

  30. Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware

  31. The Experiment • 9 applications, 13 configurations • One (more) architecture: multi-core x86 • It’s real (no simulation here) • Built with pthreads, locks, and atomics • Per-pthread task-priority-queues with work-stealing • More advanced scheduling

  32. GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE Scope: Application bonanza

  33. Ray Tracer FM MapReduce SRAD Scope: Many different idioms Merge Sort

  34. Platform: 2xQuad-core Nehalem • Queues: copy in/out, global (shared) buffer • Threads: user-level scheduled contexts • Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s

  35. Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Priority #2: ‘Reasonable’ bandwidth / storage

  36. Performance– Scheduling • Static per-stage priorities (still) • Work-stealing task-priority-queues • Eagerly create one task per packet (naïve) • Keep running stages until a low watermark • (Limited dynamic weighting of queue depths)

  37. Performance– Good Scale-out • (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

  38. Performance– Low Overheads Execution Time Breakdown (8 cores / 16 hyperthreads) • ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution

  39. Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

  40. Three archetypes • Task-Stealing: (Cilk, TBB) • Low overhead with fine granularity tasks • No producer-consumer, priorities, or data-parallel • Breadth-First: (CUDA, OpenCL) • Simple scheduler (one stage at the time) • No producer-consumer, no pipeline parallelism • Static: (StreamIt / Streaming) • No runtime scheduler; complex schedules • Cannot adapt to irregular workloads

  41. GRAMPS is a natural framework

  42. The Experiment • Re-use the exact same application code • Modify the scheduler per archetype: • Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks • Breadth-First: Unbounded queues, stage at a time, top-to-bottom • Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

  43. GRAMPS Breadth-First Task-Stealing Static (SAS) Seeing is believing (ray tracer)

  44. Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Mostly similar: good parallelism, load balance

  45. Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Breadth-first can exhibit load-imbalance

  46. Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Task-stealing can ping-pong, cause contention Percentage of Time

  47. Comparison: Footprint Relative Packet Footprint (Log-Scale) Size versus GRAMPS • Breadth-First is pathological (as expected)

  48. Relative Packet Footprint Relative Task Footprint Footprint: GRAMPS & Task-Stealing

  49. MapReduce Ray Tracer MapReduce Footprint: GRAMPS & Task-Stealing GRAMPS gets insight from the graph: • (Application-specified) queue bounds • Group tasks by stage for priority, preemption Ray Tracer

  50. Packet Footprint Static scheduling is challenging Execution Time • Generating good Static schedules is *hard*. • Static schedules are fragile: • Small mismatches compound • Hardware itself is dynamic (cache traffic, IRQs, …) • Limited upside: dynamic scheduling is cheap!