Evaluating synchronization on shared address space multiprocessors methodology performance
Download
1 / 29

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance. Sanjeev Kumar Dongming Jiang Rohit Chandra Jaswinder Pal Singh. Classic Study on Synchronization. Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance' - brit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Evaluating synchronization on shared address space multiprocessors methodology performance

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance

Sanjeev Kumar

Dongming Jiang

Rohit Chandra

Jaswinder Pal Singh


Classic study on synchronization
Classic Study on Synchronization Multiprocessors:

  • Software Algorithms for Locks and Barriers

    [Mellor-Crummey et. al., TOCS’91]

    • Multiprocessors machines

      • BBN Butterfly, Sequent Symmetry

    • Microbenchmarks

  • Little benefit from special hardware support

    • Handle memory/network contention in software


Case for hardware support
Case for Hardware Support Multiprocessors:

  • Fetch&Op [Laudon et. al., ISCA’97]

    • Origin 2000

    • Microbenchmarks (Counter & Barrier)

  • QOLB [Kagi et. al., ISCA’97]

    • Simulations

    • Microbenchmarks & Applications (Locks)

  • Better performance with Hardware Support


Our study
Our Study Multiprocessors:

  • Re-examine synchronization

    • 64 processor Origin 2000

      • New architectures CC-NUMA

      • New primitives LL-SC

    • Applications (SPLASH2) and microbenchmarks

  • Applications : Little benefit from H/W support

    • Locks : Small performance sometimes

    • Barriers : Load-imbalance dominates


Outline
Outline Multiprocessors:

  • Background

  • Performance evaluation: Microbenchmarks

    • Synchronization primitives on Origin 2000

    • Lock and Barrier algorithms and performance

  • Performance evaluation: Applications

  • Is further hardware support valuable ?

  • Conclusions


Synchronization primitives on origin 2000

LL-SC Multiprocessors:

2 instructions, Cached

Flexible

Fetch&Op

Special locations, uncached

Inflexible e.g. Atomic Swap

Performance Tradeoffs : Atomic update

  • Contention Retries

  • Contention at Memory

Performance Tradeoffs : Wait

  • Spinning in Cache

  • Cache Coherence

  • Spinning Traffic

  • No Cache Coherence

Synchronization Primitives on Origin 2000


Lock algorithms 1

Simple Multiprocessors:

One location

Available ?

No

P

P

P

P

Simple

Lock Algorithms (1)

  • Atomic test-and-set

    • LL-SC

    • Fetch&Op


Lock algorithms 2

Ticket Multiprocessors:

Like in a bakery

Proportional backoff

Next-Ticket

Now-Serving

132

125

126

127

132

P

P

P

P

Ticket

Lock Algorithms (2)

  • Atomic fetch-and-increment

    • LL-SC

    • Fetch&Op

125


Lock algorithms 3

MCS Multiprocessors:

Queuing

Local spinning

Queue

0

0

0

P

P

P

P

MCS Queuing

Lock Algorithms (3)

  • Atomic Compare-and-Swap

    • LL-SC

    • Not Fetch&Op


Lock delay microbenchmark

Simple (LL-SC) Multiprocessors:

TicketProp (LL-SC)

MCS (LL-SC)

TicketProp (Fetch&Op)

Lock-Delay Microbenchmark


Barrier algorithms 1

Central Multiprocessors:

Increment a counter

Wait on a location

Arrived

Go

5

No

P

P

P

P

Central

Barrier Algorithms (1)

  • Atomic fetch-and-increment

    • LL-SC

    • Fetch&Op


Barrier algorithms 2

Tournament Multiprocessors:

Tree of locations

Spin on different locations

Avoid hotspot and contention

0

0

0

0

0

0

P

P

P

P

Tournament

Barrier Algorithms (2)

  • Atomic fetch-and-increment

    • LL-SC

    • Fetch&Op


Barrier null microbenchmark

Central (LL-SC) Multiprocessors:

Tournament (LL-SC)

Central (Fetch&Op)

Hybrid (LL-SC, Fetch&Op)

Barrier-Null Microbenchmark


Microbenchmarks summary
Microbenchmarks Summary Multiprocessors:

  • LL-SC

    • Simplest algorithms perform poorly

      e.g. Simple lock and Central barrier

    • Smarter algorithms perform much better

  • Fetch&Op supports faster synchronization


Outline1
Outline Multiprocessors:

  • Background

  • Performance evaluation: Microbenchmarks

  • Performance evaluation: Applications

  • Is further hardware support valuable ?

  • Conclusions


Choosing applications methodology
Choosing Applications: Methodology Multiprocessors:

  • Applications from SPLASH-2

    • Undo optimizations (Added locks and barriers)

  • Problem Size

    • At least 25 fold speedup on 64 processors

  • Base case

    • Best LL-SC lock and barrier


Base performance
Base Performance Multiprocessors:


Application performance using different locks

Base : MCS,LL-SC Multiprocessors:

1.65

Application performance usingDifferent Locks

  • Better algorithm helps

  • Fetch&Op traffic hurts


Application performance using different barriers

2.68 Multiprocessors:

Base : Tournament,LL-SC

1.52

Application performance using .Different Barriers .

  • Load-imbalance dominates

  • Fetch&Op traffic hurts


Applications summary
Applications Summary Multiprocessors:

  • LL-SC

    • Locks : Better algorithm helps

    • Barriers : Load imbalance dominates

  • Fetch&Op

    • Traffic due to spinning hurts performance

  • Different from the microbenchmarks


Outline2
Outline Multiprocessors:

  • Background

  • Performance evaluation: Microbenchmarks

  • Performance evaluation: Applications

  • Is further hardware support valuable ?

    • Locks

    • Barriers

  • Conclusions


Sensitivity to lock performance

Raytrace Multiprocessors:

Radiosity

Sensitivity to Lock Performance

Adding round-trip network delays

Extrapolate : 20-30 % improvement from better hardware


When do faster locks help applications
When do faster locks help Applications ? Multiprocessors:

  • Applications sensitive to Lock performance

    • Raytrace, Radiosity ( ~ 20 -30 %)

  • Substantial time in synchronization

  • Small contended critical sections

    • Critical section size = actual + lock overhead

      • Lock overhead dilates the critical section

      • Effect on performance  size of critical section

    • 2 Apps : ~ 5 us (1-2 updates to shared locations)


Can we fix contention problems in these cases in the application
Can we fix contention problems in these cases in the Application ?

  • Yes. Fix was fairly easy

    • Raytrace

      • Global counter Partial reductions

    • Radiosity

      • Single buffer allocation queue Multiple

      • Tasks added to local queue Distribute

  • Significant performance improvement

    • Raytrace : 90%, Radiosity: 220%


Barriers
Barriers Application ?

  • Load-imbalance dominates

  • Other applications

    • Well-balanced with little communication

      • Like the microbenchmarks; Real applications ?

    • Well-balanced computation & communication

      • SOR : nearest neighbor on a grid

      • Barriers : 61 % execution time

      • Still dominates. Communication Imbalance


Summary conclusions
Summary & Conclusions Application ?

  • Fetch&op does not help Applications

    • At least for well-known lock & barrier algorithms

  • Using applications is important

  • Little benefit from hardware support

    • Locks: helps sometimes Fixable

    • Barriers: load imbalance dominates

  • Sound Methodology


Tournament barrier with fetch op
Tournament barrier Application ?with Fetch&Op

  • Worse performance

    • Preliminary measurements indicated worse overhead in addition to traffic

  • Barrier performance did not make a difference in the applications


Small problem size
Small problem size Application ?

  • Raytrace : Decreases lock time

  • Barnes : Load-imbalance increases

  • Water-Nsq : Load-imbalance and Serialization

  • Ocean & SOR : Barrier time remains same

  • Radiosity & Water-Spatial : Not available


Sor breakdown
SOR Breakdown Application ?

Load-imbalance is dominates time spent in barriers


ad