Programming many core systems with gramps l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 71

Programming Many-Core Systems with GRAMPS PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on
  • Presentation posted in: General

Programming Many-Core Systems with GRAMPS. Jeremy Sugerman 14 May 2010. The single fast core era is over. Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core

Download Presentation

Programming Many-Core Systems with GRAMPS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Programming many core systems with gramps l.jpg

Programming Many-Core Systems with GRAMPS

Jeremy Sugerman

14 May 2010


The single fast core era is over l.jpg

The single fast core era is over

  • Trends:

    Changing Metrics: ‘scale out’, not just ‘scale up’

    Increasing diversity: many different mixes of ‘cores’

  • Today’s (and tomorrow’s) machines:

    commodity, heterogeneous, many-core

    Problem: How does one program all this complexity?!


High level programming models l.jpg

High-level programming models

  • Two major advantages over threads & locks

    • Constructs to express/expose parallelism

    • Scheduling support to help manage concurrency, communication, and synchronization

  • Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …


My biases workloads l.jpg

My biases workloads

  • Interesting applications have irregularity

  • Large bundles of coherent work are efficient

  • Producer-consumer idiom is important

    Goal: Rebuild coherence dynamically by aggregating related work as it is generated.


My target audience l.jpg

My target audience

  • Highly informed, but (good) lazy

    • Understands the hardware and best practices

    • Dislikes rote, Prefers power versus constraints

      Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.


Contributions design of gramps l.jpg

Contributions: Design of GRAMPS

  • Programs are graphs of stages and queues

  • Queues:

    • Maximum capacities, Packet sizes

  • Stages:

    • No, limited, or total automatic parallelism

    • Fixed, variable, or reduction (in-place) outputs

Simple Graphics Pipeline


Contributions implementation l.jpg

Contributions: Implementation

  • Broad application scope:

    • Rendering, MapReduce, image processing, …

  • Multi-platform applicability:

    • GRAMPS runtimes for three architectures

  • Performance:

    • Scale-out parallelism, controlled data footprint

    • Compares well to schedulers from other models

  • (Also: Tunable)


Outline l.jpg

Outline

  • GRAMPS overview

  • Study 1: Future graphics architectures

  • Study 2: Current multi-core CPUs

  • Comparison with schedulers from other parallel programming models


Gramps overview l.jpg

GRAMPS Overview


Gramps l.jpg

GRAMPS

  • Programs are graphs of stages and queues

    • Expose the program structure

    • Leave the program internals unconstrained


Writing a gramps program l.jpg

Cookie Dough Pipeline

Writing a GRAMPS program

  • Design the application graph and queues:

  • Design the stages

  • Instantiate and launch.

Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html


Queues l.jpg

Queues

  • Bounded size, operate at “packet” granularity

    • “Opaque” and “Collection” packets

  • GRAMPS can optionally preserve ordering

    • Required for some workloads, adds overhead


Thread and fixed stages l.jpg

Thread (and Fixed) stages

  • Preemptible, long-lived, stateful

    • Often merge, compare, or repack inputs

  • Queue operations: Reserve/Commit

  • (Fixed: Thread stages in custom hardware)


Shader stages l.jpg

Shader stages:

  • Automatically parallelized:

    • Horde of non-preemptible, stateless instances

    • Pre-reserve/post-commit

  • Push: Variable/conditional output support

    • GRAMPS coalesces elements into full packets


Queue sets mutual exclusion l.jpg

Cookie Dough Pipeline

Queue sets: Mutual exclusion

  • Independent exclusive (serial) subqueues

    • Created statically or on first output

    • Densely or sparsely indexed

  • Bonus: Automatically instanced Thread stages


Queue sets mutual exclusion16 l.jpg

Cookie Dough (with queue set)

Queue sets: Mutual exclusion

  • Independent exclusive (serial) subqueues

    • Created statically or on first output

    • Densely or sparsely indexed

  • Bonus: Automatically instanced Thread stages


A few other tidbits l.jpg

  • In-place Shader stages /coalescing inputs

  • Instanced Thread stages

  • Queues as barriers /read all-at-once

A few other tidbits


Formative influences l.jpg

Formative influences

  • The Graphics Pipeline, early GPGPU

  • “Streaming”

  • Work-queues and task-queues


Study 1 future graphics architectures l.jpg

Study 1: Future Graphics Architectures

(with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)


Graphics is a natural first domain l.jpg

Graphics is a natural first domain

  • Table stakes for commodity parallelism

  • GPUs are full of heterogeneity

  • Poised to transition from fixed/configurable pipeline to programmable

  • We have a lot of experience in it


The graphics pipeline in gramps l.jpg

The Graphics Pipeline in GRAMPS

  • Graph, setup are (application) software

    • Can be customized or completely replaced

  • Like the transition to programmable shading

    • Not (unthinkably) radical

  • Fits current hw: FIFOs, cores, rasterizer, …


Reminder design goals l.jpg

Reminder: Design goals

  • Broad application scope

  • Multi-platform applicability

  • Performance: scale-out, footprint-aware


The experiment l.jpg

The Experiment

  • Three renderers:

    • Rasterization, Ray Tracer, Hybrid

  • Two simulated future architectures

    • Simple scheduler for each


Scope two plus renderers l.jpg

Rasterization Pipeline (with ray tracing extension)

Ray Tracing Extension

Ray Tracing Graph

Scope: Two(-plus) renderers


Platforms two simulated systems l.jpg

Platforms: Two simulated systems

CPU-Like: 8 Fat Cores, Rast

GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched


Performance metrics l.jpg

Performance— Metrics

“Maximize machine utilization while keeping working sets small”

  • Priority #1: Scale-out parallelism

    • Parallel utilization

  • Priority #2: ‘Reasonable’ bandwidth / storage

    • Worst case total footprint of all queues

    • Inherently a trade-off versus utilization


Performance scheduling l.jpg

Performance— Scheduling

Simple prototype scheduler (both platforms):

  • Static stage priorities:

  • Only preempt on Reserve and Commit

  • No dynamic weighting of current queue sizes

(Lowest)

(Highest)


Performance results l.jpg

Performance— Results

  • Utilization: 95+% for all but rasterized fairy (~80%).

  • Footprint: < 600KB CPU-like, < 1.5MB GPU-like

  • Surprised how well the simple scheduler worked

  • Maintaining order costs footprint


Study 2 current multi core cpus l.jpg

Study 2: Current Multi-core CPUs

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,

Richard Yoo; submitted to PACT 2010)


Reminder design goals30 l.jpg

Reminder: Design goals

  • Broad application scope

  • Multi-platform applicability

  • Performance: scale-out, footprint-aware


The experiment31 l.jpg

The Experiment

  • 9 applications, 13 configurations

  • One (more) architecture: multi-core x86

    • It’s real (no simulation here)

    • Built with pthreads, locks, and atomics

      • Per-pthread task-priority-queues with work-stealing

    • More advanced scheduling


Scope application bonanza l.jpg

GRAMPS

Ray tracer (0, 1 bounce)

Spheres

(No rasterization, though)

MapReduce

Hist (reduce / combine)

LR (reduce / combine)

PCA

Cilk(-like)

Mergesort

CUDA

Gaussian, SRAD

StreamIt

FM, TDE

Scope: Application bonanza


Scope many different idioms l.jpg

Ray Tracer

FM

MapReduce

SRAD

Scope: Many different idioms

Merge Sort


Platform 2xquad core nehalem l.jpg

Platform: 2xQuad-core Nehalem

  • Queues: copy in/out, global (shared) buffer

  • Threads: user-level scheduled contexts

  • Shaders: create one task per input packet

Native: 8 HyperThreaded Core i7’s


Performance metrics reminder l.jpg

Performance— Metrics (Reminder)

“Maximize machine utilization while keeping working sets small”

  • Priority #1: Scale-out parallelism

  • Priority #2: ‘Reasonable’ bandwidth / storage


Performance scheduling36 l.jpg

Performance– Scheduling

  • Static per-stage priorities (still)

  • Work-stealing task-priority-queues

  • Eagerly create one task per packet (naïve)

  • Keep running stages until a low watermark

    • (Limited dynamic weighting of queue depths)


Performance good scale out l.jpg

Performance– Good Scale-out

  • (Footprint: Good; detail a little later)

Parallel Speedup

Hardware Threads


Performance low overheads l.jpg

Performance– Low Overheads

Execution Time Breakdown (8 cores / 16 hyperthreads)

  • ‘App’ and ‘Queue’ time are both useful work.

Percentage of Execution


Comparison with other schedulers l.jpg

Comparison with Other Schedulers

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,

Richard Yoo; submitted to PACT 2010)


Three archetypes l.jpg

Three archetypes

  • Task-Stealing: (Cilk, TBB)

    • Low overhead with fine granularity tasks

    • No producer-consumer, priorities, or data-parallel

  • Breadth-First: (CUDA, OpenCL)

    • Simple scheduler (one stage at the time)

    • No producer-consumer, no pipeline parallelism

  • Static: (StreamIt / Streaming)

    • No runtime scheduler; complex schedules

    • Cannot adapt to irregular workloads


Gramps is a natural framework l.jpg

GRAMPS is a natural framework


The experiment42 l.jpg

The Experiment

  • Re-use the exact same application code

  • Modify the scheduler per archetype:

    • Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks

    • Breadth-First: Unbounded queues, stage at a time, top-to-bottom

    • Static: Unbounded queues, offline per-thread schedule using SAS / SGMS


Seeing is believing ray tracer l.jpg

GRAMPS

Breadth-First

Task-Stealing

Static (SAS)

Seeing is believing (ray tracer)


Comparison execution time l.jpg

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time

  • Mostly similar: good parallelism, load balance


Comparison execution time45 l.jpg

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time

  • Breadth-first can exhibit load-imbalance


Comparison execution time46 l.jpg

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

  • Task-stealing can ping-pong, cause contention

Percentage of Time


Comparison footprint l.jpg

Comparison: Footprint

Relative Packet Footprint (Log-Scale)

Size versus GRAMPS

  • Breadth-First is pathological (as expected)


Footprint gramps task stealing l.jpg

Relative Packet Footprint

Relative Task Footprint

Footprint: GRAMPS & Task-Stealing


Footprint gramps task stealing49 l.jpg

MapReduce

Ray Tracer

MapReduce

Footprint: GRAMPS & Task-Stealing

GRAMPS gets insight from the graph:

  • (Application-specified) queue bounds

  • Group tasks by stage for priority, preemption

Ray Tracer


Static scheduling is challenging l.jpg

Packet Footprint

Static scheduling is challenging

Execution Time

  • Generating good Static schedules is *hard*.

  • Static schedules are fragile:

    • Small mismatches compound

    • Hardware itself is dynamic (cache traffic, IRQs, …)

  • Limited upside: dynamic scheduling is cheap!


Discussion for multi core cpus l.jpg

Discussion (for multi-core CPUs)

  • Adaptive scheduling is the obvious choice.

    • Better load-balance / handling of irregularity

  • Semantic insight (app graph) gives a big advantage in managing footprint.

  • More cores, development maturity → more complex graphs and thus more advantage.


Conclusion l.jpg

Conclusion


Contributions revisited l.jpg

Contributions Revisited

  • GRAMPS programming model design

    • Graph of heterogeneous stages and queues

  • Good results from actual implementation

    • Broad scope: Wide range of applications

    • Multi-platform: Three different architectures

    • Performance: High parallelism, good footprint


Anecdotes and intuitions l.jpg

Anecdotes and intuitions

  • Structure helps: an explicit graph is handy.

  • Simple (principled) dynamic scheduling works.

  • Queues impedance match heterogeneity.

  • Graphs with cycles and push both paid off.

  • (Also: Paired instrumentation and visualization help enormously)


Conclusion future trends revisited l.jpg

Conclusion: Future trends revisited

  • Core counts are increasing

    • Parallel programming models

  • Memory and bandwidth are precious

    • Working set, locality (i.e., footprint) management

  • Power, performance driving heterogeneity

    • All ‘cores’ need to communicate, interoperate

  • GRAMPS fits them well.


Thanks l.jpg

Thanks

  • Eric, for agreeing to make this happen.

  • Christos, for throwing helpers at me.

  • Kurt, Mendel, and Pat, for, well, a lot.

  • John Gerth, for tireless computer servitude.

  • Melissa (and Heather and Ada before her)


Thanks57 l.jpg

Thanks

  • My practice audiences

  • My many collaborators

  • Daniel, Kayvon, Mike, Tim

  • Supporters at NVIDIA, ATI/AMD, Intel

  • Supporters at VMware

  • Everyone who entertained, informed, challenged me, and made me think


Thanks58 l.jpg

Thanks

  • My funding agencies:

    • Rambus Stanford Graduate Fellowship

    • Department of the Army Research

    • Stanford Pervasive Parallelism Laboratory


Slide59 l.jpg

Q&A

  • Thank you for listening!

  • Questions?


Extra material backup l.jpg

Extra Material (Backup)


Data cpu like gpu like l.jpg

Data: CPU-Like & GPU-Like


Footprint data native l.jpg

Footprint Data: Native


Tunability l.jpg

Tunability

  • Diagnosis:

    • Raw counters, statistics, logs

    • Grampsviz

  • Optimize / Control:

    • Graph topology (e.g., sort-middle vs. sort-last)

    • Queue watermarks (e.g., 10x win for ray tracing)

    • Packet size: Match SIMD widths, share data


Tunability grampsviz 1 l.jpg

Tunability– Grampsviz (1)

  • GPU-Like: Rasterization pipeline


Tunability grampsviz 2 l.jpg

Tunability– Grampsviz (2)

  • CPU-Like: Histogram (MapReduce)

Reduce

Combine


Tunability knobs l.jpg

Sort-Middle

Sort-Last

Tunability– Knobs

  • Graph topology/design:

  • Sizing critical queues:


Alternatives l.jpg

Alternatives


A few other tidbits68 l.jpg

  • Instanced Thread stages

  • Queues as barriers /read all-at-once

A few other tidbits

  • In-place Shader stages /coalescing inputs

Image Histogram Pipeline


Performance good scale out69 l.jpg

Parallel Speedup

Hardware Threads

Performance– Good Scale-out

  • (Footprint: Good; detail a little later)


Seeing is believing ray tracer70 l.jpg

Seeing is believing (ray tracer)

GRAMPS

Task-Stealing

Static (SAS)

Breadth-First


Comparison execution time71 l.jpg

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

Percentage of Time

Comparison: Execution time

  • Small ‘Sched’ time, even with large graphs


  • Login