programming many core systems with gramps
Download
Skip this Video
Download Presentation
Programming Many-Core Systems with GRAMPS

Loading in 2 Seconds...

play fullscreen
1 / 71

Programming Many-Core Systems with GRAMPS - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Programming Many-Core Systems with GRAMPS. Jeremy Sugerman 14 May 2010. The single fast core era is over. Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Programming Many-Core Systems with GRAMPS' - ratana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the single fast core era is over
The single fast core era is over
  • Trends:

Changing Metrics: ‘scale out’, not just ‘scale up’

Increasing diversity: many different mixes of ‘cores’

  • Today’s (and tomorrow’s) machines:

commodity, heterogeneous, many-core

Problem: How does one program all this complexity?!

high level programming models
High-level programming models
  • Two major advantages over threads & locks
    • Constructs to express/expose parallelism
    • Scheduling support to help manage concurrency, communication, and synchronization
  • Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …
my biases workloads
My biases workloads
  • Interesting applications have irregularity
  • Large bundles of coherent work are efficient
  • Producer-consumer idiom is important

Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

my target audience
My target audience
  • Highly informed, but (good) lazy
    • Understands the hardware and best practices
    • Dislikes rote, Prefers power versus constraints

Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

contributions design of gramps
Contributions: Design of GRAMPS
  • Programs are graphs of stages and queues
  • Queues:
    • Maximum capacities, Packet sizes
  • Stages:
    • No, limited, or total automatic parallelism
    • Fixed, variable, or reduction (in-place) outputs

Simple Graphics Pipeline

contributions implementation
Contributions: Implementation
  • Broad application scope:
    • Rendering, MapReduce, image processing, …
  • Multi-platform applicability:
    • GRAMPS runtimes for three architectures
  • Performance:
    • Scale-out parallelism, controlled data footprint
    • Compares well to schedulers from other models
  • (Also: Tunable)
outline
Outline
  • GRAMPS overview
  • Study 1: Future graphics architectures
  • Study 2: Current multi-core CPUs
  • Comparison with schedulers from other parallel programming models
gramps
GRAMPS
  • Programs are graphs of stages and queues
    • Expose the program structure
    • Leave the program internals unconstrained
writing a gramps program
Cookie Dough PipelineWriting a GRAMPS program
  • Design the application graph and queues:
  • Design the stages
  • Instantiate and launch.

Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html

queues
Queues
  • Bounded size, operate at “packet” granularity
    • “Opaque” and “Collection” packets
  • GRAMPS can optionally preserve ordering
    • Required for some workloads, adds overhead
thread and fixed stages
Thread (and Fixed) stages
  • Preemptible, long-lived, stateful
    • Often merge, compare, or repack inputs
  • Queue operations: Reserve/Commit
  • (Fixed: Thread stages in custom hardware)
shader stages
Shader stages:
  • Automatically parallelized:
    • Horde of non-preemptible, stateless instances
    • Pre-reserve/post-commit
  • Push: Variable/conditional output support
    • GRAMPS coalesces elements into full packets
queue sets mutual exclusion
Cookie Dough PipelineQueue sets: Mutual exclusion
  • Independent exclusive (serial) subqueues
    • Created statically or on first output
    • Densely or sparsely indexed
  • Bonus: Automatically instanced Thread stages
queue sets mutual exclusion16
Cookie Dough (with queue set)Queue sets: Mutual exclusion
  • Independent exclusive (serial) subqueues
    • Created statically or on first output
    • Densely or sparsely indexed
  • Bonus: Automatically instanced Thread stages
a few other tidbits
In-place Shader stages /coalescing inputs
  • Instanced Thread stages
  • Queues as barriers /read all-at-once
A few other tidbits
formative influences
Formative influences
  • The Graphics Pipeline, early GPGPU
  • “Streaming”
  • Work-queues and task-queues
study 1 future graphics architectures

Study 1: Future Graphics Architectures

(with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

graphics is a natural first domain
Graphics is a natural first domain
  • Table stakes for commodity parallelism
  • GPUs are full of heterogeneity
  • Poised to transition from fixed/configurable pipeline to programmable
  • We have a lot of experience in it
the graphics pipeline in gramps
The Graphics Pipeline in GRAMPS
  • Graph, setup are (application) software
    • Can be customized or completely replaced
  • Like the transition to programmable shading
    • Not (unthinkably) radical
  • Fits current hw: FIFOs, cores, rasterizer, …
reminder design goals
Reminder: Design goals
  • Broad application scope
  • Multi-platform applicability
  • Performance: scale-out, footprint-aware
the experiment
The Experiment
  • Three renderers:
    • Rasterization, Ray Tracer, Hybrid
  • Two simulated future architectures
    • Simple scheduler for each
platforms two simulated systems
Platforms: Two simulated systems

CPU-Like: 8 Fat Cores, Rast

GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

performance metrics
Performance— Metrics

“Maximize machine utilization while keeping working sets small”

  • Priority #1: Scale-out parallelism
    • Parallel utilization
  • Priority #2: ‘Reasonable’ bandwidth / storage
    • Worst case total footprint of all queues
    • Inherently a trade-off versus utilization
performance scheduling
Performance— Scheduling

Simple prototype scheduler (both platforms):

  • Static stage priorities:
  • Only preempt on Reserve and Commit
  • No dynamic weighting of current queue sizes

(Lowest)

(Highest)

performance results
Performance— Results
  • Utilization: 95+% for all but rasterized fairy (~80%).
  • Footprint: < 600KB CPU-like, < 1.5MB GPU-like
  • Surprised how well the simple scheduler worked
  • Maintaining order costs footprint
study 2 current multi core cpus

Study 2: Current Multi-core CPUs

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,

Richard Yoo; submitted to PACT 2010)

reminder design goals30
Reminder: Design goals
  • Broad application scope
  • Multi-platform applicability
  • Performance: scale-out, footprint-aware
the experiment31
The Experiment
  • 9 applications, 13 configurations
  • One (more) architecture: multi-core x86
    • It’s real (no simulation here)
    • Built with pthreads, locks, and atomics
      • Per-pthread task-priority-queues with work-stealing
    • More advanced scheduling
scope application bonanza
GRAMPS

Ray tracer (0, 1 bounce)

Spheres

(No rasterization, though)

MapReduce

Hist (reduce / combine)

LR (reduce / combine)

PCA

Cilk(-like)

Mergesort

CUDA

Gaussian, SRAD

StreamIt

FM, TDE

Scope: Application bonanza
platform 2xquad core nehalem
Platform: 2xQuad-core Nehalem
  • Queues: copy in/out, global (shared) buffer
  • Threads: user-level scheduled contexts
  • Shaders: create one task per input packet

Native: 8 HyperThreaded Core i7’s

performance metrics reminder
Performance— Metrics (Reminder)

“Maximize machine utilization while keeping working sets small”

  • Priority #1: Scale-out parallelism
  • Priority #2: ‘Reasonable’ bandwidth / storage
performance scheduling36
Performance– Scheduling
  • Static per-stage priorities (still)
  • Work-stealing task-priority-queues
  • Eagerly create one task per packet (naïve)
  • Keep running stages until a low watermark
    • (Limited dynamic weighting of queue depths)
performance good scale out
Performance– Good Scale-out
  • (Footprint: Good; detail a little later)

Parallel Speedup

Hardware Threads

performance low overheads
Performance– Low Overheads

Execution Time Breakdown (8 cores / 16 hyperthreads)

  • ‘App’ and ‘Queue’ time are both useful work.

Percentage of Execution

comparison with other schedulers

Comparison with Other Schedulers

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,

Richard Yoo; submitted to PACT 2010)

three archetypes
Three archetypes
  • Task-Stealing: (Cilk, TBB)
    • Low overhead with fine granularity tasks
    • No producer-consumer, priorities, or data-parallel
  • Breadth-First: (CUDA, OpenCL)
    • Simple scheduler (one stage at the time)
    • No producer-consumer, no pipeline parallelism
  • Static: (StreamIt / Streaming)
    • No runtime scheduler; complex schedules
    • Cannot adapt to irregular workloads
the experiment42
The Experiment
  • Re-use the exact same application code
  • Modify the scheduler per archetype:
    • Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks
    • Breadth-First: Unbounded queues, stage at a time, top-to-bottom
    • Static: Unbounded queues, offline per-thread schedule using SAS / SGMS
seeing is believing ray tracer
GRAMPS

Breadth-First

Task-Stealing

Static (SAS)

Seeing is believing (ray tracer)
comparison execution time
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time
  • Mostly similar: good parallelism, load balance
comparison execution time45
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time
  • Breadth-first can exhibit load-imbalance
comparison execution time46
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

  • Task-stealing can ping-pong, cause contention

Percentage of Time

comparison footprint
Comparison: Footprint

Relative Packet Footprint (Log-Scale)

Size versus GRAMPS

  • Breadth-First is pathological (as expected)
footprint gramps task stealing49
MapReduce

Ray Tracer

MapReduce

Footprint: GRAMPS & Task-Stealing

GRAMPS gets insight from the graph:

  • (Application-specified) queue bounds
  • Group tasks by stage for priority, preemption

Ray Tracer

static scheduling is challenging
Packet FootprintStatic scheduling is challenging

Execution Time

  • Generating good Static schedules is *hard*.
  • Static schedules are fragile:
    • Small mismatches compound
    • Hardware itself is dynamic (cache traffic, IRQs, …)
  • Limited upside: dynamic scheduling is cheap!
discussion for multi core cpus
Discussion (for multi-core CPUs)
  • Adaptive scheduling is the obvious choice.
    • Better load-balance / handling of irregularity
  • Semantic insight (app graph) gives a big advantage in managing footprint.
  • More cores, development maturity → more complex graphs and thus more advantage.
contributions revisited
Contributions Revisited
  • GRAMPS programming model design
    • Graph of heterogeneous stages and queues
  • Good results from actual implementation
    • Broad scope: Wide range of applications
    • Multi-platform: Three different architectures
    • Performance: High parallelism, good footprint
anecdotes and intuitions
Anecdotes and intuitions
  • Structure helps: an explicit graph is handy.
  • Simple (principled) dynamic scheduling works.
  • Queues impedance match heterogeneity.
  • Graphs with cycles and push both paid off.
  • (Also: Paired instrumentation and visualization help enormously)
conclusion future trends revisited
Conclusion: Future trends revisited
  • Core counts are increasing
    • Parallel programming models
  • Memory and bandwidth are precious
    • Working set, locality (i.e., footprint) management
  • Power, performance driving heterogeneity
    • All ‘cores’ need to communicate, interoperate
  • GRAMPS fits them well.
thanks
Thanks
  • Eric, for agreeing to make this happen.
  • Christos, for throwing helpers at me.
  • Kurt, Mendel, and Pat, for, well, a lot.
  • John Gerth, for tireless computer servitude.
  • Melissa (and Heather and Ada before her)
thanks57
Thanks
  • My practice audiences
  • My many collaborators
  • Daniel, Kayvon, Mike, Tim
  • Supporters at NVIDIA, ATI/AMD, Intel
  • Supporters at VMware
  • Everyone who entertained, informed, challenged me, and made me think
thanks58
Thanks
  • My funding agencies:
    • Rambus Stanford Graduate Fellowship
    • Department of the Army Research
    • Stanford Pervasive Parallelism Laboratory
slide59
Q&A
  • Thank you for listening!
  • Questions?
tunability
Tunability
  • Diagnosis:
    • Raw counters, statistics, logs
    • Grampsviz
  • Optimize / Control:
    • Graph topology (e.g., sort-middle vs. sort-last)
    • Queue watermarks (e.g., 10x win for ray tracing)
    • Packet size: Match SIMD widths, share data
tunability grampsviz 1
Tunability– Grampsviz (1)
  • GPU-Like: Rasterization pipeline
tunability grampsviz 2
Tunability– Grampsviz (2)
  • CPU-Like: Histogram (MapReduce)

Reduce

Combine

tunability knobs
Sort-Middle

Sort-Last

Tunability– Knobs
  • Graph topology/design:
  • Sizing critical queues:
a few other tidbits68
Instanced Thread stages
  • Queues as barriers /read all-at-once
A few other tidbits
  • In-place Shader stages /coalescing inputs

Image Histogram Pipeline

performance good scale out69
Parallel Speedup

Hardware Threads

Performance– Good Scale-out
  • (Footprint: Good; detail a little later)
seeing is believing ray tracer70
Seeing is believing (ray tracer)

GRAMPS

Task-Stealing

Static (SAS)

Breadth-First

comparison execution time71
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time
  • Small ‘Sched’ time, even with large graphs
ad