slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memor PowerPoint Presentation
Download Presentation
Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memor

Loading in 2 Seconds...

play fullscreen
1 / 28

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memor - PowerPoint PPT Presentation


  • 170 Views
  • Uploaded on

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se. Outline. NUCA Locks DSZOOM – Software-based Shared Memory TMA – Trap-based Memory Architecture.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memor' - Leo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Dissertation Seminar, 18/11 – 2005

Auditorium Minus, Museum Gustavianum

Software Techniques for

Distributed Shared Memory

Zoran Radovic

zoran.radovic@it.uu.se

Dissertation Seminar

outline
Outline
  • NUCA Locks
  • DSZOOM – Software-based Shared Memory
  • TMA – Trap-based Memory Architecture

Dissertation Seminar

vasaloppet contention problem in sweden
Vasaloppet“Contention Problem in Sweden”

85.6533 km to go…

CS

Traditional cross-country ski race

90 km …

Dissertation Seminar

spin locks under contention
Spin Locks under Contention

Spin locks

Spin locks

with backoff

IF (more contention) 

THEN less efficient CS …

“The more important the slower it runs…”

Critical Section (CS) Cost

Amount of Contention

Dissertation Seminar

queue based locks
Queue-based Locks

Queue-based locks

Spin locks

Spin locks

with backoff

CS Cost

IF (more contention) 

THEN constant CS cost …

Amount of Contention

Dissertation Seminar

this dissertation
This Dissertation

NUCA locks

Spin locks

Spin locks

with backoff

CS Cost

IF (more contention) 

THEN more efficient CS …

“The more important the faster it runs…”

Queue-based locks

Amount of Contention

Dissertation Seminar

nuca locks basic idea
NUCA Locks (Basic Idea)

1) Reduce traffic

- one CPU per node is testing…

2) Improve lock handover

3) More efficient CS

- local traffic is cheaper

Switch

Memory

Memory

Memory

$

$

$

$

$

$

$

$

$

P

P

P

P

P

P

P

P

P

Lock/Unlock

Test

Test

Test

Test

Lock/Unlock

Test

Test

Test

Test

Test

Test

Test

Dissertation Seminar

the hbo lock the simplest hbo
The HBO Lock (the simplest HBO)

Creates

Communication

Affinity

  • What do we need?
    • node_id
    • Compare&swap (CAS) atomic operation

CAS(Lock_address,FREE, node_id)

  • lock-acquire:
    • If the lock-value is in the state FREE:
      • The node_id is CAS-ed into the lock location
    • Else: 2 cases
      • The lock is “local”  Spin with small backoff
      • The lock is “remote”  Spin with large backoff
  • Simple but fairly effective…

Dissertation Seminar

performance results realistic microbenchmark 2 node wildfire 28 cpus
Performance ResultsRealistic microbenchmark, 2-node WildFire, 28 CPUs

14

14

WF

Fairness?

Dissertation Seminar

total traffic raytrace
Total Traffic: Raytrace

Dissertation Seminar

hbo locks inside linux kernel
HBO Locks inside Linux Kernel
  • Patch provided by Silicon Graphics, Inc.
    • Linux-IA64 kernel implementation, May 2005
  • Page-fault handler runs 3x faster
    • 60 processors

Dissertation Seminar

outline14
Outline
  • NUCA Locks
  • DSZOOM – Software-based Shared Memory
  • TMA – Trap-based Memory Architecture

Dissertation Seminar

the dszoom proposal
The DSZOOM Proposal

Dissertation Seminar

the dszoom proposal16
The DSZOOM Proposal
  • Run entire protocol in requesting-processor
    • No protocol agent communication!
  • Assumes user-level remote memory access
    • put, get, and atomics [  InfiniBand]
  • Fine-grain memory protocols (e.g., 64 bytes)
  • Hardware-like memory models [Shasta, Blizzard, Sirocco]

Dissertation Seminar

squeezing protocols into binaries
“Squeezing” Protocols into Binaries…

DSZOOM

Program

Original

Program

...cmp %g0, %l5 bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

...cmp %g0, %l5 bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop

Fast-path

Protocol

Code

ld [%o1 + 64], %o0

Slow-path

Protocol

Code

(C-code)

Binary/Assembler level instrumentation

Dissertation Seminar

write permission caching
Write Permission Caching
  • Problem: store instrumentation relies on locking
    • More complex instrumentation
  • Solution: write permission cache (WPC)
    • Small and fast software-managed cache
    • Keeps write permissions
  • The WPC idea:
    • Exploit store locality
    • Dynamically reduce the number of memory references in store checking code

Dissertation Seminar

other features
Other “Features”
  • Two kinds of protocols
    • Invalidate
    • Update
  • Many optimizations
    • Instrumentation scheduling (update and invalidate)
    • Instrumentation batching (invalidate)
    • WPC-based write batching (update)
    • WPC-based dirty-data filtering (update)
    • Private-data filtering (update)
    • # of WPC entries (update and invalidate)
    • Coherence unit size (update and invalidate)

Dissertation Seminar

coherence flags and profiling
Coherence Flags and Profiling
  • Coherence flags
    • Similar to optimization flags of compilers
    • Possible scenario:

gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c

  • Execution profiling
    • Similar to profile feedback of compilers
    • Helps finding appropriate coherence flag settings
    • Low overhead implementation in DSZOOM
      • Less than 30 percent overhead
    • Works for both small and large input sets

Dissertation Seminar

dszoom results 2 node wildfire 16 cpus
DSZOOM Results2-node WildFire, 16 CPUs

1.45x

1.11x

Dissertation Seminar

outline22
Outline
  • NUCA Locks
  • DSZOOM – Software-based Shared Memory
  • TMA – Trap-based Memory Architecture

Dissertation Seminar

instrumentation drawbacks
Instrumentation Drawbacks

DSZOOM

Program

Original

Program

...cmp %g0, %l5 bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

...cmp %g0, %l5 bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop

Fast-path

Protocol

Code

ld [%o1 + 64], %o0

Slow-path

Protocol

Code

(C-code)

  • Binary transparency?
  • Run-time execution overhead

Dissertation Seminar

trap based memory architectures
Trap-Based Memory Architectures
  • Basic idea
    • Detect fine-grained coherence violations in hardware
    • Trigger a coherence trap when one occur
    • Maintain coherence by software protocols
  • No memory system modifications
  • Minimal processor modifications
  • Binary Transparency
    • No need to instrument binaries/applications

Dissertation Seminar

tma lite proof of concept implementation
TMA LiteProof-of-concept Implementation
  • Load permission check
    • Hardware implementation of software check
      • Predefined “magic-value” convention
  • Store permission check
    • Hardware WPC
  • Can be seen as a very small cache
  • Operates on virtual addresses
  • Accessed in parallel with the data TLB

Dissertation Seminar

tma lite performance tma simulation study 4 nodes dszoom 2 node wildfire
TMA Lite Performance[TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire]

1.75x

1.01x

Dissertation Seminar

topics not presented
Topics not Presented
  • RH lock algorithm
    • Controlled (un)fairness
  • HBO_GT and HBO_GT_SD algorithms
    • Global throttling and starvation detection
  • DSZOOM implementation details
    • Instrumentation challenges; scheduling, batching, etc.
    • Bandwidth filtering techniques; dirty- & private-data
  • Innovative TMA simulation tricks
    • Low-level “good days” hacks
    • Reusing Simics checkpoints

Dissertation Seminar

slide28

Dissertation Seminar, 18/11 – 2005

Auditorium Minus, Museum Gustavianum

Software Techniques for

Distributed Shared Memory

Zoran Radovic

zoran.radovic@it.uu.se

Dissertation Seminar