400 likes | 406 Views
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. Jack Sampson *, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡. *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft. Motivations.
E N D
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡ *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft
Motivations • CMPs are not just small multiprocessors • Different computation/communication ratio • Different shared resources • Inter-core fabric offers potential to support optimizations/acceleration • CMPs for vector, streaming workloads
Fine-grained Parallelism • CMPs in role of vector processors • Software synchronization still expensive • Can target inner-loop parallelism • Barriers a straightforward organizing tool • Opportunity for hardware acceleration • Faster barriers allow greater parallelism • 1.2x – 6.4x on 256 element vectors • 3x – 12.2x on 1024 element vectors
Accelerating Barriers • Barrier Filters: a new method for barrier synchronization • No dedicated networks • No new instructions • Changes only in shared memory system • CMP-friendly design point • Competitive with dedicated barrier network • Achieves 77%-95% of dedicated network performance
Outline • Introduction • Barrier Filter Overview • Barrier Filter Implementation • Results • Summary
Observation and Intuition • Observations • Barriers need to stall forward progress • There exist events that already stall processors • Co-opt and extend existing stall behavior • Cache misses • Either I-Cache or D-Cache suffices
High Level Barrier Behavior • A thread can be in one of three states • Executing • Perform work • Enforce memory ordering • Signal arrival at barrier • Blocking • Stall at barrier until all arrive • Resuming • Release from barrier
Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Barrier Filter Example • CMP augmented with filter • Private L1 • Shared, banked L2
Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Memory Ordering • Before/after for memory • Each thread executes a memory fence
Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Communication with filter • Each thread invalidates a designated cache line
Thread A: BLOCKING Filter State # Threads:3 Arrived-counter: 1 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Invalidationpropagates to shared L2 cache • Filter snoops the invalidation • Checks address for match • Records arrival
Thread C: BLOCKING Filter State # Threads:3 Arrived-counter: 2 Arrived-counter: 1 Thread A: BLOCKING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Invalidationpropagates to shared L2 cache • Filter snoops the invalidation • Checks address for match • Records arrival
Filter State # Threads:3 Arrived-counter: 2 Thread A: BLOCKING Thread B: EXECUTING Thread C: BLOCKING Example: Stalling • Thread A attempts to fetch the invalidated data • Fill request not satisfied • Thread stalling mechanism
Thread C: RESUMING Thread A: RESUMING Thread B: RESUMING Filter State # Threads:3 Arrived-counter: 0 Arrived-counter: 2 Thread A: BLOCKING Thread B: EXECUTING Thread C: BLOCKING Example: Release • Last thread signals arrival • Barrier release • Counter resets • Filter state for all threads switches
Filter State # Threads:3 Arrived-counter: 0 Thread A: RESUMING Thread B: RESUMING Thread C: RESUMING Example: Release • After release • New cache-fill requests served • Filter serves pending cache-fills
Filter State # Threads:3 Arrived-counter: 0 Thread A: RESUMING Thread B: RESUMING Thread C: RESUMING Example: Release • After release • New cache-fill requests served • Filter serves pending cache-fills
Outline • Introduction • Barrier Filter Overview • Barrier Filter Implementation • Results • Summary
Software Interface • Communication requirements • Let hardware know # of threads • Let threads know signal addresses • Barrier filters as virtualized resource • Library interface • Pure software fallback • User scenario • Application calls OS to create barrier with # threads • OS allocates barrier filter, relays address and # threads • OS returns address to application
Barrier Filter Hardware • Additional hardware: “address filter” • In controller for shared memory level • State table, associated FSMs • Snoops invalidations, fill requests for designated addresses • Makes use of existing instructions and existing interconnect network
Barrier Filter Internals • Each barrier filter supports one barrier • Barrier state • Per-thread state, FSMs • Multiple barrier filters • In each controller • In banked caches, at a particular bank
Barrier Filter Internals • Each barrier filter supports one barrier • Barrier state • Per-thread state, FSMs • Multiple barrier filters • In each controller • In banked caches, at a particular bank
Barrier Filter Internals • Each barrier filter supports one barrier • Barrier state • Per-thread state, FSMs • Multiple barrier filters • In each controller • In banked caches, at a particular bank
Why have an exit address? • Needed for re-entry to barriers • When does Resuming again become Executing? • Additional fill requests may be issued • Delivery is not a guarantee of receipt • Context switches • Migration • Cache eviction
Ping-Pong Optimization • Draws from sense reversal barriers • Entry and exit operations as duals • Two alternating arrival addresses • Each conveys exit to the other’s barrier • Eliminates explicit invalidate of exit address
Outline • Introduction • Barrier Filter Overview • Barrier Filter Implementation • Results • Summary
Methodology • Used modified version of SMT-Sim • We performed experiments using 7 different barrier implementations • Software: • Centralized, combining tree • Hardware: • Filter barrier (4 variants), dedicated barrier network • We examined performance over a set of parallelizeable kernels • Livermore loops 2, 3, 6 • EEMBC kernels autocorrelation, viterbi
Benchmark Selection • Barriers are seen as heavyweight operations • Infrequently executed in most workloads • Example: Ocean from SPLASH-2 • On simulated 16 core CMP: 4% of time in barriers • Barriers will be used more frequently on CMPs
Average time of barrier execution (in isolation) #threads = #cores Latency Micro-benchmark
Notable effects due to bus saturation Barrier filter scales well up until this point Latency Micro-benchmark
Filters closer to dedicated network than software Significant speedup vs. software still exhibited Latency Micro-benchmark
Autocorrelation Kernel • On 16 core CMP • 7.98x speedup for dedicated network • 7.31x speedup for best filter barrier • 3.86 speedup for best software barrier • Significant speedup opportunities with fast barriers
Viterbi Kernel Viterbi on 4 core CMP • Not all applications can scale to arbitrary number of cores • Viterbi performance higher on 4 or 8 cores than on 16 cores
Serial/parallel crossover HW achieves on 4x smaller problem Livermore Loops Livermore Loop 3 on 16-core CMP
Reduction in parallelism to avoid false sharing Livermore Loops Livermore Loop 3 on 16-core CMP
Result Summary • Fine-grained parallelism on CMPs • Significant speedups possible • 1.2x – 6.4x on 256 element vectors • 3x – 12.2x on 1024 element vectors • False sharing affects problem size/scaling • Faster barriers allow greater parallelism • HW approaches extend worthwhile problem sizes • Barrier filters give competitive performance • 77% - 95% of dedicated network performance
Conclusions • Fast barriers • Can organize fine-grained data parallelism on a CMP • CMPs can act in a vector processor role • Exploit inner-loop parallelism • Barrier filters • CMP-oriented fast barrier
(FIN) • Questions?