Programming Many-Core Systems with GRAMPS

Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

The single fast core era is over • Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ • Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!

High-level programming models • Two major advantages over threads & locks • Constructs to express/expose parallelism • Scheduling support to help manage concurrency, communication, and synchronization • Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

My biases workloads • Interesting applications have irregularity • Large bundles of coherent work are efficient • Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

My target audience • Highly informed, but (good) lazy • Understands the hardware and best practices • Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

Contributions: Design of GRAMPS • Programs are graphs of stages and queues • Queues: • Maximum capacities, Packet sizes • Stages: • No, limited, or total automatic parallelism • Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline

Contributions: Implementation • Broad application scope: • Rendering, MapReduce, image processing, … • Multi-platform applicability: • GRAMPS runtimes for three architectures • Performance: • Scale-out parallelism, controlled data footprint • Compares well to schedulers from other models • (Also: Tunable)

Outline • GRAMPS overview • Study 1: Future graphics architectures • Study 2: Current multi-core CPUs • Comparison with schedulers from other parallel programming models

GRAMPS Overview

GRAMPS • Programs are graphs of stages and queues • Expose the program structure • Leave the program internals unconstrained

Cookie Dough Pipeline Writing a GRAMPS program • Design the application graph and queues: • Design the stages • Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html

Queues • Bounded size, operate at “packet” granularity • “Opaque” and “Collection” packets • GRAMPS can optionally preserve ordering • Required for some workloads, adds overhead

Thread (and Fixed) stages • Preemptible, long-lived, stateful • Often merge, compare, or repack inputs • Queue operations: Reserve/Commit • (Fixed: Thread stages in custom hardware)

Shader stages: • Automatically parallelized: • Horde of non-preemptible, stateless instances • Pre-reserve/post-commit • Push: Variable/conditional output support • GRAMPS coalesces elements into full packets

Cookie Dough Pipeline Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages

Cookie Dough (with queue set) Queue sets: Mutual exclusion • Independent exclusive (serial) subqueues • Created statically or on first output • Densely or sparsely indexed • Bonus: Automatically instanced Thread stages

In-place Shader stages /coalescing inputs • Instanced Thread stages • Queues as barriers /read all-at-once A few other tidbits

Formative influences • The Graphics Pipeline, early GPGPU • “Streaming” • Work-queues and task-queues

Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

Graphics is a natural first domain • Table stakes for commodity parallelism • GPUs are full of heterogeneity • Poised to transition from fixed/configurable pipeline to programmable • We have a lot of experience in it

The Graphics Pipeline in GRAMPS • Graph, setup are (application) software • Can be customized or completely replaced • Like the transition to programmable shading • Not (unthinkably) radical • Fits current hw: FIFOs, cores, rasterizer, …

Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware

The Experiment • Three renderers: • Rasterization, Ray Tracer, Hybrid • Two simulated future architectures • Simple scheduler for each

Rasterization Pipeline (with ray tracing extension) Ray Tracing Extension Ray Tracing Graph Scope: Two(-plus) renderers

Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

Performance— Metrics “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Parallel utilization • Priority #2: ‘Reasonable’ bandwidth / storage • Worst case total footprint of all queues • Inherently a trade-off versus utilization

Performance— Scheduling Simple prototype scheduler (both platforms): • Static stage priorities: • Only preempt on Reserve and Commit • No dynamic weighting of current queue sizes (Lowest) (Highest)

Performance— Results • Utilization: 95+% for all but rasterized fairy (~80%). • Footprint: < 600KB CPU-like, < 1.5MB GPU-like • Surprised how well the simple scheduler worked • Maintaining order costs footprint

Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

Reminder: Design goals • Broad application scope • Multi-platform applicability • Performance: scale-out, footprint-aware

The Experiment • 9 applications, 13 configurations • One (more) architecture: multi-core x86 • It’s real (no simulation here) • Built with pthreads, locks, and atomics • Per-pthread task-priority-queues with work-stealing • More advanced scheduling

GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE Scope: Application bonanza

Ray Tracer FM MapReduce SRAD Scope: Many different idioms Merge Sort

Platform: 2xQuad-core Nehalem • Queues: copy in/out, global (shared) buffer • Threads: user-level scheduled contexts • Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s

Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” • Priority #1: Scale-out parallelism • Priority #2: ‘Reasonable’ bandwidth / storage

Performance– Scheduling • Static per-stage priorities (still) • Work-stealing task-priority-queues • Eagerly create one task per packet (naïve) • Keep running stages until a low watermark • (Limited dynamic weighting of queue depths)

Performance– Good Scale-out • (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

Performance– Low Overheads Execution Time Breakdown (8 cores / 16 hyperthreads) • ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution

Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

Three archetypes • Task-Stealing: (Cilk, TBB) • Low overhead with fine granularity tasks • No producer-consumer, priorities, or data-parallel • Breadth-First: (CUDA, OpenCL) • Simple scheduler (one stage at the time) • No producer-consumer, no pipeline parallelism • Static: (StreamIt / Streaming) • No runtime scheduler; complex schedules • Cannot adapt to irregular workloads

GRAMPS is a natural framework

The Experiment • Re-use the exact same application code • Modify the scheduler per archetype: • Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks • Breadth-First: Unbounded queues, stage at a time, top-to-bottom • Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

GRAMPS Breadth-First Task-Stealing Static (SAS) Seeing is believing (ray tracer)

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Mostly similar: good parallelism, load balance

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time • Breadth-first can exhibit load-imbalance

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Percentage of Time Comparison: Execution time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) • Task-stealing can ping-pong, cause contention Percentage of Time

Comparison: Footprint Relative Packet Footprint (Log-Scale) Size versus GRAMPS • Breadth-First is pathological (as expected)

Relative Packet Footprint Relative Task Footprint Footprint: GRAMPS & Task-Stealing

MapReduce Ray Tracer MapReduce Footprint: GRAMPS & Task-Stealing GRAMPS gets insight from the graph: • (Application-specified) queue bounds • Group tasks by stage for priority, preemption Ray Tracer

Packet Footprint Static scheduling is challenging Execution Time • Generating good Static schedules is *hard*. • Static schedules are fragile: • Small mismatches compound • Hardware itself is dynamic (cache traffic, IRQs, …) • Limited upside: dynamic scheduling is cheap!

Programming Many-Core Systems with GRAMPS

Programming Many-Core Systems with GRAMPS

Presentation Transcript

Programming Multi-Core Systems

L1: Introduction CS 6235: Parallel Programming for Many-Core Architectures

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008

Beyond Auto-Parallelization: Compilers for Many-Core Systems

Multi-core Programming

Session 2: Many Core

Multi-core Programming

Scalable Many-Core Memory Systems Optional Topic 5 : Interconnects

Multi-core programming frameworks for embedded systems

Doing More With GRAMPS

Programming Many-Core Systems with GRAMPS

McRT: Many-Core Runtime

GRAMPS: A Programming Model For Graphics Pipelines

Programming many core systems Marco Bekooij

Scaling Many-core Applications with COTS Clustering

Graphics on GRAMPS

Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors

Programming Heterogeneous Systems with CORBA

Programming many core systems Marco Bekooij

Multi-core programming frameworks for embedded systems