Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡ *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft

Motivations • CMPs are not just small multiprocessors • Different computation/communication ratio • Different shared resources • Inter-core fabric offers potential to support optimizations/acceleration • CMPs for vector, streaming workloads

Fine-grained Parallelism • CMPs in role of vector processors • Software synchronization still expensive • Can target inner-loop parallelism • Barriers a straightforward organizing tool • Opportunity for hardware acceleration • Faster barriers allow greater parallelism • 1.2x – 6.4x on 256 element vectors • 3x – 12.2x on 1024 element vectors

Accelerating Barriers • Barrier Filters: a new method for barrier synchronization • No dedicated networks • No new instructions • Changes only in shared memory system • CMP-friendly design point • Competitive with dedicated barrier network • Achieves 77%-95% of dedicated network performance

Outline • Introduction • Barrier Filter Overview • Barrier Filter Implementation • Results • Summary

Observation and Intuition • Observations • Barriers need to stall forward progress • There exist events that already stall processors • Co-opt and extend existing stall behavior • Cache misses • Either I-Cache or D-Cache suffices

High Level Barrier Behavior • A thread can be in one of three states • Executing • Perform work • Enforce memory ordering • Signal arrival at barrier • Blocking • Stall at barrier until all arrive • Resuming • Release from barrier

Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Barrier Filter Example • CMP augmented with filter • Private L1 • Shared, banked L2

Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Memory Ordering • Before/after for memory • Each thread executes a memory fence

Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Communication with filter • Each thread invalidates a designated cache line

Thread A: BLOCKING Filter State # Threads:3 Arrived-counter: 1 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Invalidationpropagates to shared L2 cache • Filter snoops the invalidation • Checks address for match • Records arrival

Thread C: BLOCKING Filter State # Threads:3 Arrived-counter: 2 Arrived-counter: 1 Thread A: BLOCKING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Invalidationpropagates to shared L2 cache • Filter snoops the invalidation • Checks address for match • Records arrival

Filter State # Threads:3 Arrived-counter: 2 Thread A: BLOCKING Thread B: EXECUTING Thread C: BLOCKING Example: Stalling • Thread A attempts to fetch the invalidated data • Fill request not satisfied • Thread stalling mechanism

Thread C: RESUMING Thread A: RESUMING Thread B: RESUMING Filter State # Threads:3 Arrived-counter: 0 Arrived-counter: 2 Thread A: BLOCKING Thread B: EXECUTING Thread C: BLOCKING Example: Release • Last thread signals arrival • Barrier release • Counter resets • Filter state for all threads switches

Filter State # Threads:3 Arrived-counter: 0 Thread A: RESUMING Thread B: RESUMING Thread C: RESUMING Example: Release • After release • New cache-fill requests served • Filter serves pending cache-fills

Software Interface • Communication requirements • Let hardware know # of threads • Let threads know signal addresses • Barrier filters as virtualized resource • Library interface • Pure software fallback • User scenario • Application calls OS to create barrier with # threads • OS allocates barrier filter, relays address and # threads • OS returns address to application

Barrier Filter Hardware • Additional hardware: “address filter” • In controller for shared memory level • State table, associated FSMs • Snoops invalidations, fill requests for designated addresses • Makes use of existing instructions and existing interconnect network

Barrier Filter Internals • Each barrier filter supports one barrier • Barrier state • Per-thread state, FSMs • Multiple barrier filters • In each controller • In banked caches, at a particular bank

Why have an exit address? • Needed for re-entry to barriers • When does Resuming again become Executing? • Additional fill requests may be issued • Delivery is not a guarantee of receipt • Context switches • Migration • Cache eviction

Ping-Pong Optimization • Draws from sense reversal barriers • Entry and exit operations as duals • Two alternating arrival addresses • Each conveys exit to the other’s barrier • Eliminates explicit invalidate of exit address

Methodology • Used modified version of SMT-Sim • We performed experiments using 7 different barrier implementations • Software: • Centralized, combining tree • Hardware: • Filter barrier (4 variants), dedicated barrier network • We examined performance over a set of parallelizeable kernels • Livermore loops 2, 3, 6 • EEMBC kernels autocorrelation, viterbi

Benchmark Selection • Barriers are seen as heavyweight operations • Infrequently executed in most workloads • Example: Ocean from SPLASH-2 • On simulated 16 core CMP: 4% of time in barriers • Barriers will be used more frequently on CMPs

Average time of barrier execution (in isolation) #threads = #cores Latency Micro-benchmark

Notable effects due to bus saturation Barrier filter scales well up until this point Latency Micro-benchmark

Filters closer to dedicated network than software Significant speedup vs. software still exhibited Latency Micro-benchmark

Autocorrelation Kernel • On 16 core CMP • 7.98x speedup for dedicated network • 7.31x speedup for best filter barrier • 3.86 speedup for best software barrier • Significant speedup opportunities with fast barriers

Viterbi Kernel Viterbi on 4 core CMP • Not all applications can scale to arbitrary number of cores • Viterbi performance higher on 4 or 8 cores than on 16 cores

Serial/parallel crossover HW achieves on 4x smaller problem Livermore Loops Livermore Loop 3 on 16-core CMP

Reduction in parallelism to avoid false sharing Livermore Loops Livermore Loop 3 on 16-core CMP

Result Summary • Fine-grained parallelism on CMPs • Significant speedups possible • 1.2x – 6.4x on 256 element vectors • 3x – 12.2x on 1024 element vectors • False sharing affects problem size/scaling • Faster barriers allow greater parallelism • HW approaches extend worthwhile problem sizes • Barrier filters give competitive performance • 77% - 95% of dedicated network performance

Conclusions • Fast barriers • Can organize fine-grained data parallelism on a CMP • CMPs can act in a vector processor role • Exploit inner-loop parallelism • Barrier filters • CMP-oriented fast barrier

(FIN) • Questions?

Extra Graphs

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Presentation Transcript

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

Enhancing Fine-Grained Parallelism Part II

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism

Exploiting Parallelism

Qilin : Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Multiprocessors and Thread-Level Parallelism

Fine-Grain Parallelism

Exploiting Both Pipelining and Data Parallelism with SIMD RA

The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism

Fine-grained and Coarse-grained Word Sense Disambiguation

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

On-chip Parallelism

Fine-Grained Layered Multicast

On-chip Parallelism

Exploiting Parallelism

Enhancing Fine-Grained Parallelism

Fine-Grained Soils:

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors

Enhancing Fine-Grained Parallelism Part II

Exploiting Parallelism