denovond efficient hardware support for disciplined non determinism n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism PowerPoint Presentation
Download Presentation
DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

Loading in 2 Seconds...

play fullscreen
1 / 28

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism. Hyojin Sung , Rakesh Komuravelli , and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign. Motivation. Shared memory is de-facto model for multicore SW and HW BUT …

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism' - truong


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
denovond efficient hardware support for disciplined non determinism

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

Hyojin Sung, RakeshKomuravelli, and Sarita V. Adve

Department of Computer Science

University of Illinois at Urbana-Champaign

motivation
Motivation
  • Shared memory is de-facto model for multicore SW and HW
  • BUT …
    • Complex SW: data races, unstructured parallelism, memory model, …
    • Inefficient HW: complex coherence/consistency, unnecessary traffic, …
  • Recent work on disciplined shared memory
    • SW: Easier programming model
    • HW: If SW is more disciplined, can we build more efficient HW?
      • DeNovo: Holistic rethinking of entire memory hierarchy
disciplined shared memory
Disciplined Shared Memory

Disciplined Shared-Memory =

Global address space

+

Implicit, anywhere communication, synchronization

  • Explicit, structured side-effects
disciplined shared memory1
Disciplined Shared Memory
  • Deterministic Parallel Java (DPJ) – strong safety properties
  • Determinism-by-default, simple semantics

OOPSLA ‘09

explicit effects

Disciplined Shared Memory

structured parallel control

  • DeNovo – performance, complexity and power efficient
  • Simplify coherence and consistency

PACT ‘11

limitation
Limitation
  • DeNovo for deterministic programs
    • Important assumptions
      • No conflicting concurrent accesses, only barrier synchronization
      • Known side-effects
    • Allowed DeNovo to eliminate design complexity and inefficiency
  • Challenges for nondeterministic programs
    • The assumptions do not hold any more
      • Can have conflicting concurrent accesses, support lock synchronization
      • Side-effects unknown in critical sections
    • Applications with lock-based non-determinism are common
contribution
Contribution
  • Deterministic Parallel Java (DPJ) – strong safety properties
  • Determinism-by-default, simple semantics

Explicit & safe non-determinism

POPL ‘11

explicit effects

Disciplined Shared Memory

structured parallel control

  • DeNovoND: Non-deterministic codes with benefits of DeNovo
  • Minimal additional HW for non-determinism
  • Comparable performance to MESI
  • 30% lower network traffic than MESI
  • PLUS all advantages of DeNovo for deterministic codes
outline
Outline
  • Motivation
  • Background
    • DPJ/DeNovo for deterministic codes
    • DPJ support for disciplined non-determinism
  • DeNovoND Design
  • DeNovoND Implementation
  • Evaluation
  • Conclusion and Future Work
dpj for deterministic c odes
DPJ for Deterministic Codes

.

.

.

  • Structured parallel control
    • Fork-join parallelism
  • Explicit region and effect
    • Regions divide heap
    • Read or write effects on regions
  • Data-race freedom guarantee
    • Simple, modular type checking

ST

ST

ST

ST

LD

.

.

.

write

effect

heap

dpj for deterministic c odes1
DPJ for Deterministic Codes

.

.

.

Hardware – simplify coherence problems!

  • Java-compatible type system
  • Structured parallel control
    • Fork-join parallelism
  • Explicit region and effect
    • Regions divide heap
    • Read or write effects on regions
  • Data-race freedom guarantee
    • Simple, modular type checking

ST

ST

ST

ST

LD

.

.

.

write

effect

heap

denovo for deterministic c odes
DeNovo for Deterministic Codes
  • Coherence Enforcement
    • Invalidate stale copies in private cache
    • Track up-to-date copy
  • Explicit effects
    • Compiler knows all writeable regions in this parallel phase
    • Cache can self-invalidate before next parallel phase
  • Registration
    • Directory keeps track of one up-to-date copy
    • Writer registers itself before next parallel phase
denovo for d eterministic c odes
DeNovo for Deterministic Codes
  • No space overhead
    • Keep valid data or registered core id
    • LLC data arrays double as directory
  • No transient states
  • No invalidation traffic
  • No false sharing

registry

Invalid

Valid

Read

Write

Write

Registered

example run
Example Run

L1 of Core 1

L1 of Core 2

X in DeNovo-region

Y in DeNovo-region

ST

ST

.

.

Registration

Registration

Shared L2

Ack

Ack

self-invalidate( )

Registered

Valid

Invalid

dpj support for safe non determinism
DPJ Support for Safe Non-Determinism

.

.

.

  • Nondeterminism comes from conflicting concurrent accesses
  • Isolate these accesses as “atomic”
    • Enclosed in “atomic” sections
    • “Atomic” regions and effects
  • “Disciplined” non-determinism
    • Race freedom, strong isolation
    • Determinism-by-default semantics

ST

LD

.

.

.

  • DeNovoND converts “atomic” statements into locks
outline1
Outline
  • Motivation
  • Background
  • DeNovoND Design
    • Memory Consistency Model
    • Distributed Queue-based Lock
  • DeNovoND Implementation
  • Evaluation
  • Conclusion and Future Work
memory consistency model
Memory Consistency Model

.

.

.

  • Deterministic accesses
    • Same task in this parallel phase
    • Or before this parallel phase

DeNovo

Coherence

Mechanism

.

.

ST 0xa

Parallel

Phase

LD 0xa

memory consistency model1
Memory Consistency Model

.

.

.

  • Non-deterministic accesses
    • Same task in this parallel phase
    • Or before this parallel phase
    • Or in preceding critical sections

.

.

ST 0xa

Parallel

Phase

ST 0xa

Critical

Section

LD 0xa

coherence for non deterministic data
Coherence for non-deterministic data
  • Coherence Enforcement
    • Invalidate stale copies in private cache
    • Track up-to-date copy
  • When to invalidate?
    • Between the start of critical section and any read
  • What to invalidate?
    • Entire cache? regions with “atomic” effect?
    • Track atomic writes in a signature, transfer with lock
  • Registration
    • Writer updates before next critical section
distributed queue based lock
Distributed Queue-based Lock
  • Lock primitive that works on DeNovoND
    • No directory, no write invalidation  No spinning for lock
  • Modeled after QOSB Lock
    • Lock requests form a distributed queue
    • But much simpler
  • Details in the paper
outline2
Outline
  • Motivation
  • Background
  • DeNovoND Design
  • DeNovoND Implementation
  • Evaluation
  • Conclusion and Future Work
access signatures
Access Signatures
  • Simple and small hardware Bloom filter per core
    • Track accesses with “atomic” effects only
    • Only 256 bits suffice
  • Operations on Bloom filter
    • On write: insert address
    • On read: query filter for address for self-invalidation
example run1

Read miss

Registration

Example Run

Registration

lock transfer

X in DeNovo-region

Y in DeNovo-region

Z in atomic DeNovo-region

W in atomic DeNovo-region

L1 of Core 1

L1 of Core 2

Read miss

Z W

Z W

Ack

LD

ST

ST

lock transfer

.

.

LD

Shared L2

Ack

self-invalidate( )

self-invalidate( )

reset filter

optimization to reduce self invalidation
Optimization to reduce self-invalidation

X in DeNovo-region

Y in DeNovo-region

Z in atomic DeNovo-region

W in atomic DeNovo-region

  • loads in Registered state
  • “Touched-atomic” bit
    • Set on first atomic load
    • Subsequent load don’t self-invalidate
  • More in the paper

ST

LD

.

.

LD

LD

self-invalidate( )

overheads
Overheads
  • Hardware Bloom filter
    • 256 bits per core
  • Storage overhead
    • One additional state, but no storage overhead (2 bits)
    • “Touched-atomic” bit per word in L1
  • Communication overhead
    • Bloom filter piggybacked on lock transfer message
    • Writeback messages for locks
      • Lock writebacks carry more info
evaluation methodology
Evaluation Methodology
  • Simulator: Simics + GEMS + Garnet
  • System Parameters
    • 16 in-order cores
  • Workloads
    • SPLASH-2, PARSEC and STAMP
    • Unchanged except region/effect and self-invalidation
  • Protocols
    • MESI and DeNovoND
    • With idealized locks and realistic locks
mesi vs denovond i dealized lock
MESI vs. DeNovoND: Idealized lock
  • DeNovoND performs comparable to MESI for all apps
    • For both DIL-INF and DIL-256

barnesocean water fluidanimatestreamclustertspkmeans ssca2

mesi vs denovond r ealistic lock
MESI vs. DeNovoND: Realistic lock
  • pthread lock vs. distributed queue-based lock
  • DeNovoND performs comparable or better than MESI

barnesocean water fluidanimatestreamclustertspkmeans ssca2

network traffic realistic lock
Network Traffic (Realistic lock)
  • DeNovoND has 33% less traffic than MESI (67% max)
    • No invalidation traffic
    • Reduced load misses due to lack of false sharing

barnesocean water fluidanimatestreamclustertspkmeans ssca2

conclusions and future work
Conclusions and Future Work
  • DeNovoND: Efficient HW support for non-determinism
    • Minimal additional HW for safe non-determinism
    • Comparable performance to MESI
    • 30% lower network traffic than MESI
    • PLUS all advantages of DeNovo for deterministic codes
  • Future work: broaden the application space further
    • Pipeline parallelism, “lock-free” data structures, OS, legacy codes…