memory models a case for rethinking parallel languages and hardware n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 42

- PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Memory Models: A Case for Rethinking Parallel Languages and Hardware †. Sarita V. Adve University of Illinois sadve@illinois.edu Acks : Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh, Hans Boehm, Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - stella


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
memory models a case for rethinking parallel languages and hardware

Memory Models: A Case for Rethinking Parallel Languages and Hardware†

Sarita V. Adve

University of Illinois

sadve@illinois.edu

Acks: Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh,

Hans Boehm,Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir,

Byn Choi, Rakesh Komuravelli, Hyojin Sung

† Also a paper by S. V. Adve & H.-J. Boehm, To appear in CACM

memory consistency models
Memory Consistency Models

Parallelism for the masses!

Shared-memory most common

Memory model = Legal values for reads

memory consistency models1
Memory Consistency Models

Parallelism for the masses!

Shared-memory most common

Memory model = Legal values for reads

memory consistency models2
Memory Consistency Models

Parallelism for the masses!

Shared-memory most common

Memory model = Legal values for reads

memory consistency models3
Memory Consistency Models

Parallelism for the masses!

Shared-memory most common

Memory model = Legal values for reads

memory consistency models4
Memory Consistency Models

Parallelism for the masses!

Shared-memory most common

Memory model = Legal values for reads

20 years of memory models
20 Years of Memory Models
  • Memory model is at the heart of concurrency semantics
    • 20 year journey from confusion to convergence at last!
    • Hard lessons learned
    • Implications for future
  • Current way to specify concurrency semantics is too hard
    • Fundamentally broken
  • Must rethink parallel languages and hardware
    • E.g., Illinois Deterministic Parallel Java, DeNovo architecture
what is a memory model
What is a Memory Model?
  • Memory model defines what values a read can return

Initially A=B=C=Flag=0

Thread 1 Thread 2

A = 26 while (Flag != 1) {;}

B = 90 r1 = B

… r2 = A

Flag = 1 …

90

0

26

memory model is key to concurrency semantics
Memory Model is Key to Concurrency Semantics
  • Interface between program and transformersof program
    • Defines what values a read can return

Dynamic optimizer

C++ program

Compiler

Assembly

Hardware

  • Weakest system component exposed to the programmer
    • Language level model has implications for hardware
  • Interface must last beyond trends
desirable properties of a memory model
Desirable Properties of a Memory Model
  • 3 Ps
    • Programmability
    • Performance
    • Portability
  • Challenge: hard to satisfy all 3 Ps
    • Late 1980’s - 90’s: Largely driven by hardware
      • Lots of models, little consensus
    • 2000 onwards: Largely driven by languages/compilers
      • Consensus model for Java, C++ (C, others ongoing)
      • Had to deal with mismatches in hardware models

Path to convergence has lessons for future

programmability sc lamport79
Programmability – SC [Lamport79]
  • Programmability: Sequential consistency (SC) most intuitive
    • Operations of a single thread in program order
    • All operations in a total order or atomic
  • But Performance?
    • Recent (complex) hardware techniques boost performance with SC
    • But compiler transformations still inhibited
  • But Portability?
    • Almost all h/w, compilers violate SC today

SC not practical, but…

next best thing sc almost always
Next Best Thing – SC Almost Always
  • Parallel programming too hard even with SC
    • Programmers (want to) write well structured code
    • Explicit synchronization, no data races

Thread 1 Thread 2

Lock(L) Lock(L)

Read Data1 Read Data2

Write Data2 Write Data1

… …

Unlock(L) Unlock(L)

    • SC for such programs much easier: can reorder data accesses
  • Data-race-free model[AdveHill90]
    • SC for data-race-free programs
    • No guarantees for programs with data races
definition of a data race
Definition of a Data Race
  • Distinguish between data and non-data (synchronization) accesses
  • Only need to define for SC executions  total order
  • Two memory accesses form a race if
    • From different threads, to same location, at least one is a write
    • Occur one after another

Thread 1 Thread 2

Write, A, 26

Write, B, 90

Read, Flag, 0

Write, Flag, 1

Read, Flag, 1

Read, B, 90 Read, A, 26

  • A race with a data access is a data race
  • Data-race-free-program = No data race in any SC execution
data race free model
Data-Race-Free Model

Data-race-free model = SC for data-race-free programs

  • Does not preclude races for wait-free constructs, etc.
    • Requires races be explicitly identified as synchronization
    • E.g., use volatile variables in Java, atomics in C++
  • Dekker’s algorithm

Initially Flag1 = Flag2 = 0

volatile Flag1, Flag2

Thread1Thread2

Flag1 = 1 Flag2 = 1

if Flag2 == 0 if Flag1 == 0

//critical section //critical section

SC prohibits both loads returning 0

data race free approach
Data-Race-Free Approach
  • Programmer’s model: SC for data-race-free programs
  • Programmability
    • Simplicity of SC, for data-race-free programs
  • Performance
    • Specifies minimal constraints (for SC-centric view)
  • Portability
    • Language must provide way to identify races
    • Hardware must provide way to preserve ordering on races
    • Compiler must translate correctly
1990 s in practice the memory models mess
1990's in Practice (The Memory Models Mess)
  • Hardware
    • Implementation/performance-centric view
    • Different vendors had different models – most non-SC
      • Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, …
    • Various ordering guarantees + fences to impose other orders
    • Many ambiguities - due to complexity, by design(?), …
  • High-level languages
    • Most shared-memory programming with Pthreads, OpenMP
      • Incomplete, ambiguous model specs
      • Memory model property of language, not library [Boehm05]
    • Java – commercially successful language with threads
      • Chapter 17 of Java language spec on memory model
      • But hard to interpret, badly broken

LD

LD

ST

ST

Fence

LD

ST

LD

ST

2000 2004 java memory model
2000 – 2004: Java Memory Model
  • ~ 2000: Bill Pugh publicized fatal flaws in Java model
  • Lobbied Sun to form expert group to revise Java model
  • Open process via mailing list
    • Diverse participants
    • Took 5 years of intense, spirited debates
    • Many competing models
    • Final consensus model approved in 2005 for Java 5.0

[MansonPughAdve POPL 2005]

java memory model highlights
Java Memory Model Highlights
  • Quick agreement that SC for data-race-free was required
  • Missing piece: Semantics for programs with data races
    • Java cannot have undefined semantics for ANY program
    • Must ensure safety/security guarantees
      • Limit damage from data races in untrusted code
  • Goal: Satisfy security/safety, w/ maximum system flexibility
    • Problem: “safety/security, limited damage” w/ threads very vague
java memory model highlights1
Java Memory Model Highlights

Initially X=Y=0

Thread 1 Thread 2

r1 = X r2 = Y

Y = r1 X = r2

Is r1=r2=42 allowed?

Data races produce causality loop!

Definition of a causality loop was surprisingly hard

Common compiler optimizations seem to violate“causality”

java memory model highlights2
Java Memory Model Highlights
  • Final model based on consensus, but complex
    • Programmers can (must) use “SC for data-race-free”
    • But system designers must deal with complexity
    • Correctness tools, racy programs, debuggers, …??
    • Recent discovery of bugs [SevcikAspinall08]
2005 c microsoft prism multicore
2005 - :C++, Microsoft Prism, Multicore
  • ~ 2005: Hans Boehm initiated C++ concurrency model
    • Prior status: no threads in C++, most concurrency w/ Pthreads
  • Microsoft concurrently started its own internal effort
  • C++ easier than Java because it is unsafe
    • Data-race-free is plausible model
  • BUT multicore New h/w optimizations, more scrutiny
    • Mismatched h/w, programming views became painfully obvious
    • Debate that SC for data-race-free inefficient w/ hardware models
hardware implications of data race free
Hardware Implications of Data-Race-Free
  • Synchronization (volatiles/atomics) must appear SC
    • Each thread’s synch must appear in program order

synch Flag1, Flag2

Thread 1 Thread 2

Flag1 = 1 Flag2 = 1

Fence Fence

if Flag2 == 0 if Flag1 == 0

critical section critical section

SC  both reads cannot return 0

    • Requires efficient fences between synch stores/loads
    • All synchs must appear in a total order (atomic)
implications of atomic synch writes
Implications of Atomic Synch Writes

Independent reads, independent writes (IRIW):

Initially X=Y=0

T1 T2 T3 T4

X = 1 Y = 1 … = Y … = X

fence fence

… = X … = Y

SC  no thread sees new value until old copies invalidated

  • Shared caches w/ hyperthreading/multicore make this harder
  • Programmers don’t usually use IRIW
  • Why pay cost for SC in h/w if not useful to s/w?

1

1

0

0

c challenges
C++ Challenges

2006: Pressure to change Java/C++ to remove SC baseline

To accommodate some hardware vendors

  • But what is alternative?
    • Must allow some hardware optimizations
    • But must be teachable to undergrads
  • Showed such an alternative (probably) does not exist
c compromise
C++ Compromise
  • Default C++ model is data-race-free
  • AMD, Intel, … on board
  • But
    • Some systems need expensive fence for SC
    • Some programmers really want more flexibility
      • C++ specifies low-level model only for experts
      • Complicates spec, but only for experts
      • We are not advertising this part
        • [BoehmAdve PLDI 2008]
summary of current status
Summary of Current Status
  • Convergence to “SC for data-race-free” as baseline
  • For programs with data races
    • Minimal but complex semantics for safe languages
    • No semantics for unsafe languages
lessons learned
Lessons Learned
  • SC for data-race-free minimal baseline
  • Specifying semantics for programs with data races is HARD
    • But “no semantics for data races” also has problems
      • Not an option for safe languages; debugging; correctness checking tools
  • Hardware-software mismatch for some code
    • “Simple” optimizations have unintended consequences
  • State-of-the-art is fundamentally broken
lessons learned1
Lessons Learned
  • SC for data-race-free minimal baseline
  • Specifying semantics for programs with data races is HARD
    • But “no semantics for data races” also has problems
      • Not an option for safe languages; ebugging; correctness checking tools
  • Hardware-software mismatch for some code
    • “Simple” optimizations have unintended consequences
  • State-of-the-art is fundamentally broken

Banish shared-memory?

lessons learned2
Lessons Learned
  • SC for data-race-free minimal baseline
  • Specifying semantics for programs with data races is HARD
    • But “no semantics for data races” also has problems
      • Not an option for safe languages; debugging; correctness checking tools
  • Hardware-software mismatch for some code
    • “Simple” optimizations have unintended consequences
  • State-of-the-art is fundamentally broken
  • We need
    • Higher-level disciplined models that enforce discipline
    • Hardware co-designed with high-level models

Banish wild shared-memory!

lessons learned3
Lessons Learned
  • SC for data-race-free minimal baseline
  • Specifying semantics for programs with data races is HARD
    • But “no semantics for data races” also has problems
      • Not an option for safe languages; debugging; correctness checking tools
  • Hardware-software mismatch for some code
    • “Simple” optimizations have unintended consequences
  • State-of-the-art is fundamentally broken
  • We need
    • Higher-level disciplined models that enforce discipline
    • Hardware co-designed with high-level models

Banish wild shared-memory!

Deterministic Parallel Java [V. Adve et al.]

DeNovo hardware

research agenda for languages
Research Agenda for Languages
  • Disciplined shared-memory models
    • Simple
    • Enforceable
    • Expressive
    • Performance

Key: What discipline?

How to enforce it?

data race free
Data-Race-Free
  • A near-term discipline: Data-race-free
  • Enforcement
    • Ideally, language prohibits by design
    • Else, runtime catches as exception
  • But data-race-free still not sufficiently high level
deterministic by default parallel programming
Deterministic-by-Default Parallel Programming
  • Even data-race-free parallel programs are too hard
    • Multiple interleavings due to unordered synchronization (or races)
    • Makes reasoning and testing hard
  • But many algorithms are deterministic
    • Fixed input gives fixed output
    • Standard model for sequential programs
    • Also holds for many transformative parallel programs
      • Parallelism not part of problem specification, only for performance

Why write such an algorithm in non-deterministic style, then struggle to understand and control its behavior?

deterministic by default model
Deterministic-by-Default Model
  • Parallel programs should be deterministic-by-default
    • Sequential semantics (easier than SC!)
  • If non-determinism is needed
    • should be explicitly requested
    • should be isolated from deterministic parts
  • Enforcement:
    • Ideally, language prohibits by design
    • Else, runtime
state of the art
State-of-the-art
  • Many deterministic languages today
    • Functional, pure data parallel, some domain-specific, …
    • Much recent work on runtime, library-based approaches
  • Our work: Language approach for modern O-O methods
    • Deterministic Parallel Java (DPJ) [V. Adve et al.]
deterministic parallel java dpj
Deterministic Parallel Java (DPJ)
  • Object-oriented type and effect system
    • Use “named” regions to partition the heap
    • Annotate methods with effect summaries: regions read or written
    • If program type-checks, guaranteed deterministic
      • Simple, modular compiler checking
      • No run-time checks today, may add in future
    • Side benefit: regions, effects are valuable documentation
  • Extended sequential subset of Java (DPC++ ongoing)
    • Initial evaluation for expressivity, performance [Oopsla09]
    • Integrating disciplined non-determinism
    • Encapsulating frameworks and unchecked code
    • Semi-automatic tool for effect annotations [ASE09]
research agenda for hardware
Research Agenda for Hardware
  • Current hardware not matched even to current model
  • Near term: ISA changes, speculation
  • Long term: Co-design hardware with new software models
illinois denovo project
Illinois DeNovo Project
  • Design hardware to exploit disciplined parallelism
    • Simpler hardware
    • Scalable performance
    • Power/energy efficiency
  • Working with DPJ as example disciplined model
    • Exploit data-race-freedom, region/effect information
      • Simpler coherence
      • Efficient communication: point to point, bulk, …
      • Efficient data layout: region vs. cache line centric memory
      • New hardware/software interface
cache coherence
Cache Coherence
  • Commonly accepted definition (software-oblivious)
    • All writes to the same location appear in the same order
    • Source of much complexity
    • Coherence protocols to scale to 1000 cores?
  • What do we really need (software-aware)?
    • Get the right data to the right task at the right time
    • Disciplined models make it easier to determine what is “right”

(Assume only for-each loops)

      • Read must return value of
        • Last write in its own task or
        • Last write in previous for-each loop
today s coherence protocols
Today's Coherence Protocols
  • Snooping
    • Broadcast, ordered networks
  • Directory – avoid broadcast through level of indirection
    • Complexity: Races in protocol
    • Performance: Level of indirection
    • Overhead: Sharer list
today s coherence protocols1
Today's Coherence Protocols
  • Snooping
    • Broadcast, ordered networks
  • Directory – avoid broadcast through level of indirection
    • Complexity: Races in protocol
      • Race-free software race-free coherence protocol
    • Performance: Level of indirection
      • But definition of coherence no longer requires serialization
    • Overhead: Sharer list
      • Region-effects enable self-invalidations

+ No false sharing, flexible communication granularity, region based data layout

 Simpler, more efficient DeNovo protocol

conclusions
Conclusions
  • Current way to specify concurrency semantics fundamentally broken
    • Best we can do is SC for data-race-free
      • But cannot hide from programs with data races
    • Mismatched hardware-software
      • Simple optimizations give unintended consequences
  • Need
    • High-level disciplined models that enforce discipline
    • Hardware co-designed with high-level model

DPJ – deterministic-by-default parallel programming

DeNovo – hardware for disciplined parallel programming

  • Previous memory models convergence from similar process
    • But this time, let’s co-design s/w, h/w