1 / 40

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. Jack Sampson *, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡. *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft. Motivations.

bwylie
Download Presentation

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡ *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft

  2. Motivations • CMPs are not just small multiprocessors • Different computation/communication ratio • Different shared resources • Inter-core fabric offers potential to support optimizations/acceleration • CMPs for vector, streaming workloads

  3. Fine-grained Parallelism • CMPs in role of vector processors • Software synchronization still expensive • Can target inner-loop parallelism • Barriers a straightforward organizing tool • Opportunity for hardware acceleration • Faster barriers allow greater parallelism • 1.2x – 6.4x on 256 element vectors • 3x – 12.2x on 1024 element vectors

  4. Accelerating Barriers • Barrier Filters: a new method for barrier synchronization • No dedicated networks • No new instructions • Changes only in shared memory system • CMP-friendly design point • Competitive with dedicated barrier network • Achieves 77%-95% of dedicated network performance

  5. Outline • Introduction • Barrier Filter Overview • Barrier Filter Implementation • Results • Summary

  6. Observation and Intuition • Observations • Barriers need to stall forward progress • There exist events that already stall processors • Co-opt and extend existing stall behavior • Cache misses • Either I-Cache or D-Cache suffices

  7. High Level Barrier Behavior • A thread can be in one of three states • Executing • Perform work • Enforce memory ordering • Signal arrival at barrier • Blocking • Stall at barrier until all arrive • Resuming • Release from barrier

  8. Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Barrier Filter Example • CMP augmented with filter • Private L1 • Shared, banked L2

  9. Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Memory Ordering • Before/after for memory • Each thread executes a memory fence

  10. Filter State # Threads:3 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Communication with filter • Each thread invalidates a designated cache line

  11. Thread A: BLOCKING Filter State # Threads:3 Arrived-counter: 1 Arrived-counter: 0 Thread A: EXECUTING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Invalidationpropagates to shared L2 cache • Filter snoops the invalidation • Checks address for match • Records arrival

  12. Thread C: BLOCKING Filter State # Threads:3 Arrived-counter: 2 Arrived-counter: 1 Thread A: BLOCKING Thread B: EXECUTING Thread C: EXECUTING Example: Signaling Arrival • Invalidationpropagates to shared L2 cache • Filter snoops the invalidation • Checks address for match • Records arrival

  13. Filter State # Threads:3 Arrived-counter: 2 Thread A: BLOCKING Thread B: EXECUTING Thread C: BLOCKING Example: Stalling • Thread A attempts to fetch the invalidated data • Fill request not satisfied • Thread stalling mechanism

  14. Thread C: RESUMING Thread A: RESUMING Thread B: RESUMING Filter State # Threads:3 Arrived-counter: 0 Arrived-counter: 2 Thread A: BLOCKING Thread B: EXECUTING Thread C: BLOCKING Example: Release • Last thread signals arrival • Barrier release • Counter resets • Filter state for all threads switches

  15. Filter State # Threads:3 Arrived-counter: 0 Thread A: RESUMING Thread B: RESUMING Thread C: RESUMING Example: Release • After release • New cache-fill requests served • Filter serves pending cache-fills

  16. Filter State # Threads:3 Arrived-counter: 0 Thread A: RESUMING Thread B: RESUMING Thread C: RESUMING Example: Release • After release • New cache-fill requests served • Filter serves pending cache-fills

  17. Outline • Introduction • Barrier Filter Overview • Barrier Filter Implementation • Results • Summary

  18. Software Interface • Communication requirements • Let hardware know # of threads • Let threads know signal addresses • Barrier filters as virtualized resource • Library interface • Pure software fallback • User scenario • Application calls OS to create barrier with # threads • OS allocates barrier filter, relays address and # threads • OS returns address to application

  19. Barrier Filter Hardware • Additional hardware: “address filter” • In controller for shared memory level • State table, associated FSMs • Snoops invalidations, fill requests for designated addresses • Makes use of existing instructions and existing interconnect network

  20. Barrier Filter Internals • Each barrier filter supports one barrier • Barrier state • Per-thread state, FSMs • Multiple barrier filters • In each controller • In banked caches, at a particular bank

  21. Barrier Filter Internals • Each barrier filter supports one barrier • Barrier state • Per-thread state, FSMs • Multiple barrier filters • In each controller • In banked caches, at a particular bank

  22. Barrier Filter Internals • Each barrier filter supports one barrier • Barrier state • Per-thread state, FSMs • Multiple barrier filters • In each controller • In banked caches, at a particular bank

  23. Why have an exit address? • Needed for re-entry to barriers • When does Resuming again become Executing? • Additional fill requests may be issued • Delivery is not a guarantee of receipt • Context switches • Migration • Cache eviction

  24. Ping-Pong Optimization • Draws from sense reversal barriers • Entry and exit operations as duals • Two alternating arrival addresses • Each conveys exit to the other’s barrier • Eliminates explicit invalidate of exit address

  25. Outline • Introduction • Barrier Filter Overview • Barrier Filter Implementation • Results • Summary

  26. Methodology • Used modified version of SMT-Sim • We performed experiments using 7 different barrier implementations • Software: • Centralized, combining tree • Hardware: • Filter barrier (4 variants), dedicated barrier network • We examined performance over a set of parallelizeable kernels • Livermore loops 2, 3, 6 • EEMBC kernels autocorrelation, viterbi

  27. Benchmark Selection • Barriers are seen as heavyweight operations • Infrequently executed in most workloads • Example: Ocean from SPLASH-2 • On simulated 16 core CMP: 4% of time in barriers • Barriers will be used more frequently on CMPs

  28. Average time of barrier execution (in isolation) #threads = #cores Latency Micro-benchmark

  29. Notable effects due to bus saturation Barrier filter scales well up until this point Latency Micro-benchmark

  30. Filters closer to dedicated network than software Significant speedup vs. software still exhibited Latency Micro-benchmark

  31. Autocorrelation Kernel • On 16 core CMP • 7.98x speedup for dedicated network • 7.31x speedup for best filter barrier • 3.86 speedup for best software barrier • Significant speedup opportunities with fast barriers

  32. Viterbi Kernel Viterbi on 4 core CMP • Not all applications can scale to arbitrary number of cores • Viterbi performance higher on 4 or 8 cores than on 16 cores

  33. Serial/parallel crossover HW achieves on 4x smaller problem Livermore Loops Livermore Loop 3 on 16-core CMP

  34. Reduction in parallelism to avoid false sharing Livermore Loops Livermore Loop 3 on 16-core CMP

  35. Result Summary • Fine-grained parallelism on CMPs • Significant speedups possible • 1.2x – 6.4x on 256 element vectors • 3x – 12.2x on 1024 element vectors • False sharing affects problem size/scaling • Faster barriers allow greater parallelism • HW approaches extend worthwhile problem sizes • Barrier filters give competitive performance • 77% - 95% of dedicated network performance

  36. Conclusions • Fast barriers • Can organize fine-grained data parallelism on a CMP • CMPs can act in a vector processor role • Exploit inner-loop parallelism • Barrier filters • CMP-oriented fast barrier

  37. (FIN) • Questions?

  38. Extra Graphs

More Related