Evaluating synchronization on shared address space multiprocessors methodology performance
Download
1 / 29

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance - PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance. Sanjeev Kumar Dongming Jiang Rohit Chandra Jaswinder Pal Singh. Classic Study on Synchronization. Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance' - brit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Evaluating synchronization on shared address space multiprocessors methodology performance

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance

Sanjeev Kumar

Dongming Jiang

Rohit Chandra

Jaswinder Pal Singh


Classic study on synchronization
Classic Study on Synchronization Multiprocessors:

  • Software Algorithms for Locks and Barriers

    [Mellor-Crummey et. al., TOCS’91]

    • Multiprocessors machines

      • BBN Butterfly, Sequent Symmetry

    • Microbenchmarks

  • Little benefit from special hardware support

    • Handle memory/network contention in software


Case for hardware support
Case for Hardware Support Multiprocessors:

  • Fetch&Op [Laudon et. al., ISCA’97]

    • Origin 2000

    • Microbenchmarks (Counter & Barrier)

  • QOLB [Kagi et. al., ISCA’97]

    • Simulations

    • Microbenchmarks & Applications (Locks)

  • Better performance with Hardware Support


Our study
Our Study Multiprocessors:

  • Re-examine synchronization

    • 64 processor Origin 2000

      • New architectures CC-NUMA

      • New primitives LL-SC

    • Applications (SPLASH2) and microbenchmarks

  • Applications : Little benefit from H/W support

    • Locks : Small performance sometimes

    • Barriers : Load-imbalance dominates


Outline
Outline Multiprocessors:

  • Background

  • Performance evaluation: Microbenchmarks

    • Synchronization primitives on Origin 2000

    • Lock and Barrier algorithms and performance

  • Performance evaluation: Applications

  • Is further hardware support valuable ?

  • Conclusions


Synchronization primitives on origin 2000

LL-SC Multiprocessors:

2 instructions, Cached

Flexible

Fetch&Op

Special locations, uncached

Inflexible e.g. Atomic Swap

Performance Tradeoffs : Atomic update

  • Contention Retries

  • Contention at Memory

Performance Tradeoffs : Wait

  • Spinning in Cache

  • Cache Coherence

  • Spinning Traffic

  • No Cache Coherence

Synchronization Primitives on Origin 2000


Lock algorithms 1

Simple Multiprocessors:

One location

Available ?

No

P

P

P

P

Simple

Lock Algorithms (1)

  • Atomic test-and-set

    • LL-SC

    • Fetch&Op


Lock algorithms 2

Ticket Multiprocessors:

Like in a bakery

Proportional backoff

Next-Ticket

Now-Serving

132

125

126

127

132

P

P

P

P

Ticket

Lock Algorithms (2)

  • Atomic fetch-and-increment

    • LL-SC

    • Fetch&Op

125


Lock algorithms 3

MCS Multiprocessors:

Queuing

Local spinning

Queue

0

0

0

P

P

P

P

MCS Queuing

Lock Algorithms (3)

  • Atomic Compare-and-Swap

    • LL-SC

    • Not Fetch&Op


Lock delay microbenchmark

Simple (LL-SC) Multiprocessors:

TicketProp (LL-SC)

MCS (LL-SC)

TicketProp (Fetch&Op)

Lock-Delay Microbenchmark


Barrier algorithms 1

Central Multiprocessors:

Increment a counter

Wait on a location

Arrived

Go

5

No

P

P

P

P

Central

Barrier Algorithms (1)

  • Atomic fetch-and-increment

    • LL-SC

    • Fetch&Op


Barrier algorithms 2

Tournament Multiprocessors:

Tree of locations

Spin on different locations

Avoid hotspot and contention

0

0

0

0

0

0

P

P

P

P

Tournament

Barrier Algorithms (2)

  • Atomic fetch-and-increment

    • LL-SC

    • Fetch&Op


Barrier null microbenchmark

Central (LL-SC) Multiprocessors:

Tournament (LL-SC)

Central (Fetch&Op)

Hybrid (LL-SC, Fetch&Op)

Barrier-Null Microbenchmark


Microbenchmarks summary
Microbenchmarks Summary Multiprocessors:

  • LL-SC

    • Simplest algorithms perform poorly

      e.g. Simple lock and Central barrier

    • Smarter algorithms perform much better

  • Fetch&Op supports faster synchronization


Outline1
Outline Multiprocessors:

  • Background

  • Performance evaluation: Microbenchmarks

  • Performance evaluation: Applications

  • Is further hardware support valuable ?

  • Conclusions


Choosing applications methodology
Choosing Applications: Methodology Multiprocessors:

  • Applications from SPLASH-2

    • Undo optimizations (Added locks and barriers)

  • Problem Size

    • At least 25 fold speedup on 64 processors

  • Base case

    • Best LL-SC lock and barrier


Base performance
Base Performance Multiprocessors:


Application performance using different locks

Base : MCS,LL-SC Multiprocessors:

1.65

Application performance usingDifferent Locks

  • Better algorithm helps

  • Fetch&Op traffic hurts


Application performance using different barriers

2.68 Multiprocessors:

Base : Tournament,LL-SC

1.52

Application performance using .Different Barriers .

  • Load-imbalance dominates

  • Fetch&Op traffic hurts


Applications summary
Applications Summary Multiprocessors:

  • LL-SC

    • Locks : Better algorithm helps

    • Barriers : Load imbalance dominates

  • Fetch&Op

    • Traffic due to spinning hurts performance

  • Different from the microbenchmarks


Outline2
Outline Multiprocessors:

  • Background

  • Performance evaluation: Microbenchmarks

  • Performance evaluation: Applications

  • Is further hardware support valuable ?

    • Locks

    • Barriers

  • Conclusions


Sensitivity to lock performance

Raytrace Multiprocessors:

Radiosity

Sensitivity to Lock Performance

Adding round-trip network delays

Extrapolate : 20-30 % improvement from better hardware


When do faster locks help applications
When do faster locks help Applications ? Multiprocessors:

  • Applications sensitive to Lock performance

    • Raytrace, Radiosity ( ~ 20 -30 %)

  • Substantial time in synchronization

  • Small contended critical sections

    • Critical section size = actual + lock overhead

      • Lock overhead dilates the critical section

      • Effect on performance  size of critical section

    • 2 Apps : ~ 5 us (1-2 updates to shared locations)


Can we fix contention problems in these cases in the application
Can we fix contention problems in these cases in the Application ?

  • Yes. Fix was fairly easy

    • Raytrace

      • Global counter Partial reductions

    • Radiosity

      • Single buffer allocation queue Multiple

      • Tasks added to local queue Distribute

  • Significant performance improvement

    • Raytrace : 90%, Radiosity: 220%


Barriers
Barriers Application ?

  • Load-imbalance dominates

  • Other applications

    • Well-balanced with little communication

      • Like the microbenchmarks; Real applications ?

    • Well-balanced computation & communication

      • SOR : nearest neighbor on a grid

      • Barriers : 61 % execution time

      • Still dominates. Communication Imbalance


Summary conclusions
Summary & Conclusions Application ?

  • Fetch&op does not help Applications

    • At least for well-known lock & barrier algorithms

  • Using applications is important

  • Little benefit from hardware support

    • Locks: helps sometimes Fixable

    • Barriers: load imbalance dominates

  • Sound Methodology


Tournament barrier with fetch op
Tournament barrier Application ?with Fetch&Op

  • Worse performance

    • Preliminary measurements indicated worse overhead in addition to traffic

  • Barrier performance did not make a difference in the applications


Small problem size
Small problem size Application ?

  • Raytrace : Decreases lock time

  • Barnes : Load-imbalance increases

  • Water-Nsq : Load-imbalance and Serialization

  • Ocean & SOR : Barrier time remains same

  • Radiosity & Water-Spatial : Not available


Sor breakdown
SOR Breakdown Application ?

Load-imbalance is dominates time spent in barriers