Eliminating read barriers through procrastination and cleanliness
Download
1 / 52

Eliminating Read Barriers through Procrastination and Cleanliness - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Eliminating Read Barriers through Procrastination and Cleanliness. KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan. MultiMLton. Deterministic Parallelism Effect Isolation. Asynchronous CML (ACML) Parasitic Threads GC?. MLton for many-cores

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Eliminating Read Barriers through Procrastination and Cleanliness' - pepin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Eliminating read barriers through procrastination and cleanliness

Eliminating Read Barriers through Procrastination and Cleanliness

KC Sivaramakrishnan

Lukasz Ziarek

Suresh Jagannathan


Multimlton
MultiMLton Cleanliness

  • Deterministic Parallelism

  • Effect Isolation

  • Asynchronous CML (ACML)

  • Parasitic Threads

  • GC?

  • MLton for many-cores

    • Standard ML – functional PL with side-effects

  • Goals – Safe and Scalable programs


Multimlton runtime system
MultiMLton - Runtime System Cleanliness

  • User-level threads

  • Preemptive scheduling

  • Work-pushing

Asynchronous CML

Scheduler Substrate

SML

One-shot continuations

Parasitic Threads

VProc

VProc

VProc

VProc

C


Stop the world serial gc
Stop-the-world, Serial GC Cleanliness

  • MLton GC  MultiMLton GC quickly

  • Sansom’s “Dual-mode garbage collection”

    • Dynamically switch between 2-space to 1-space

    • Cheney’s copying  Jonkers’ sliding mark-compact

    • No fragmentation

    • Bump-pointer allocation

    • Appel’s Generational collection

  • Adding multicore support

    • Memory allocated modified for local allocation

    • GC is still stop-the-world serial


How did we do
How did we do? Cleanliness


Many core architectural trends
Many-core architectural trends Cleanliness

AMD “MagnyCours”

48-cores

Tilera Tile64

64-cores

Intel SCC

48-cores

  • Many-core architectural trends

    • NUMA effects

    • Cache coherence


Local collector
Local collector Cleanliness

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc


Local collector1
Local collector Cleanliness

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc


Local collector2
Local collector Cleanliness

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc


Local collector3
Local collector Cleanliness

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc

No Synchronization for local allocation/collection!

Local collection is Samson’s Dual mode

Shared heap is not the nth generation


Thread local collectors
Thread-local collectors Cleanliness

D. Doligez et al. (POPL’93) – SML with threads

R. Jones et al. (SCAM '05) – Java

B. Steensgaard (ISMM ’00) – subset of Java

T. Anderson (ISMM’10) – A variant of MIT’s pH

S. Marlow et al. (ISMM’11) –GHC

S. Auhagen et al. (MSPC’11)– Manticore


Write barrier
Write Barrier Cleanliness

Shared Heap

r := x

r

Target

Exporting writes

Local Heap

Source

x


Write barrier1
Write Barrier Cleanliness

Shared Heap

r := x

r

x

Transitive closure of x

Local Heap


Write barrier2
Write Barrier Cleanliness

Shared Heap

r := x

r

x

Transitive closure of x

Local Heap

x


Write barrier3
Write Barrier Cleanliness

Shared Heap

r := x

r

x

Local Heap

FWD

Mutator needs read barrier

Mutations <<< Reads


Read barrier overheads
Read Barrier Overheads Cleanliness

20.1 %

15.3 %

21.3 %


Read barrier statistics
Read Barrier Statistics Cleanliness

pointer readBarrier (pointer *p) {

if (getHeader(p) == FORWARDED)

return *(pointer*)p;

return p;

}

Checks

Forwarded


Eliminate read b arriers
Eliminate read Cleanlinessbarriers?

  • No need for read barriers if mutator can never witness forwarded objects

    • Do a local GC every time you export

    • Slower than with read barriers

  • Dynamically ensure mutators never get to see forwarded objects.

    • Procrastination: Exploit program concurrency to delay exporting writes

    • Cleanliness: Object closure cleanliness


New idea procrastination
New idea: Procrastination Cleanliness

T1

T2

Shared Heap

r1

r2

 r1 := x1

r2 := x2

Local Heap

x1

x2

T

 T is running

T

 T is suspended

T

 T is blocked


Procrastination
Procrastination Cleanliness

T1

T2

Shared Heap

r1

r2

r1 := x1

 r2 := x2

Control switches to T2

Local Heap

x1

x2

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Procrastination1
Procrastination Cleanliness

T1

T2

Shared Heap

r1

r2

r1 := x1

r2 := x2

Local Heap

x1

x2

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Procrastination2
Procrastination Cleanliness

T1

T2

Shared Heap

r1

x1

r2

x2

r1 := x1

r2 := x2

Local Heap

x1

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Procrastination3
Procrastination Cleanliness

T1

T2

Shared Heap

r1

x1

r2

x2

r1 := x1

 …

r2 := x2

Local Heap

Force local GC

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Is procrastination alone enough
Is Procrastination alone enough? Cleanliness

Procrastination depends on availability of Runnable threads @ exporting write

Runnable threads << Total threads (Thread Density)

Eager exporting writes preserving “mutator never sees forwarding pointers” invariant.


Exporting write characteristics
Exporting write characteristics Cleanliness

  • Sources of exporting writes

    • Immutable >> Mutable

    • Tend to be young

    • References rarely from outside the closure (other than stacks)

  • Object closure cleanliness

    • Heap Sessions (Young objects)

    • Reference counts (Safety of eager export)


Heap session
Heap Session Cleanliness

Local Heap

Previous Session

Current Session

Free

SessionStart

Frontier

  • Sessions closed/started after a

    • User-level thread switch

    • Exporting write

    • Local GC


Reference counting
Reference Counting Cleanliness

Local heap allocated object

Object in current session

Count number of references to current session objects

Does not consider references from stacks or registers

Count is one of ZERO, ONE, LOCAL_MANY, GLOBAL


Cleanliness
Cleanliness Cleanliness

  • An object closure is said to be clean, if for each object O in the closure

    • O is immutable or is in the shared heap. Or,

    • O is the root, and has ZERO references. Or,

    • O is not the root, and has ONE reference. Or,

    • O is not the root, has LOCAL_MANY references, and is in the current session.


Cleanliness1
Cleanliness Cleanliness

  • An object closure is said to be clean, if for each object O in the closure

    • O is immutable or is in the shared heap. Or,

    • O is the root, and has ZERO references. Or,

    • O is not the root, and has ONE reference. Or,

    • O is not the root, has LOCAL_MANY references, and is in the current session.


Cleanliness2
Cleanliness Cleanliness

  • An object closure is said to be clean, if for each object O in the closure

    • O is immutable or is in the shared heap. Or,

    • O is the root, and has ZERO references. Or,

    • O is not the root, and has ONE reference. Or,

    • O is not the root, has LOCAL_MANY references, and is in the current session.

  • Boils down to 2 cases:

    • Tree-structured closure

    • Arbitrary Graph



Graph session based
Graph – Session Based Cleanliness

Trace current session


Write barrier4
Write Barrier Cleanliness

1: ValwriteBarrier (Ref r, Val v) {

2:if(isInSharedHeap (r) && isInLocalHeap (v)) {

3: needsFixup= False;

4:if(isClean(v, &needsFixup))

5: v = lift(v, needsFixup); //lift eagerly

6:else

7: v = suspendTillGCAndLift (v); //delay write

8: }

9: return v;

10:}


Write barrier5
Write Barrier Cleanliness

  • Summary

    • Read barrier are expensive in MultiMLton

    • Eliminate read barriers by avoiding mutator from ever witnessing forwarding pointers

1: ValwriteBarrier (Ref r, Val v) {

2:if(isInSharedHeap (r) && isInLocalHeap (v)) {

3: needsFixup= False;

4:if(isClean(v, &needsFixup))

5: v = lift(v, needsFixup); //lift eagerly

6:else

7: v = suspendTillGCAndLift (v); //delay write

8: }

9: return v;

10:}


Benchmark characteristics
Benchmark Characteristics Cleanliness

Lots of concurrency

Low sharing


Performance on amd
Performance on AMD Cleanliness

At 3X:

---------

RB+ 32%

STW 106%

BDW 584%


Performance on azul
Performance on CleanlinessAZUL

At 3X:

---------

RB+ 30%


Multimlton scc implementation
MultiMLton - SCC implementation Cleanliness

Shared heap

Local heap

  • No cache-cache coherence

    • Cluster-on-chip Architecture

  • Private off-die DRAM Regions (one per Core)

    • Caches enabled! One Linux instance per Core!

    • Local heaps reside here

  • Shared / Global off-die DRAM Region

    • Caches disabled per default!

    • Shared heap resides here

  • Shared on-die MPB Regions

    • Cached in L1, L2 Bypass / Fast L1 Invalidation for MPB-Data

    • Coordinating VProcs


Performance on scc
Performance on CleanlinessSCC

At 3X:

---------

RB+ 20%



Cleanliness impact 2
Cleanliness Impact Cleanliness(2)

Low thread density


Session impact
Session Impact Cleanliness


Conclusion
Conclusion Cleanliness

  • Local collectors seem to be a good choice for many-core architectures

    • Better Cache Behavior

    • Minimize NUMA effects

    • Overcome cache coherence issues (partially)

  • Read barriers in local collectors can be expensive

  • Eliminate them through procrastination and cleanliness


Backup slides
Backup slides Cleanliness


Mlton heap layout
MLton Heap Layout Cleanliness

From Space (major)

Nursery

Heap

To Space (major)

Old Gen

To Space (minor)

Nursery


Mlton gc minor collection
MLton CleanlinessGC – Minor Collection

To Space (major)

Old Gen

To Space (minor)

Nursery

To Space (major)

Old Gen

To Space (minor)

Nursery


Mlton gc major copying collection
MLton CleanlinessGC – Major Copying Collection

To Space (major)

Old Gen

Old Gen

To Space (minor)

Nursery

To Space (major)

From Space


Mlton gc major mark compact
MLton CleanlinessGC – Major Mark-Compact

Old Gen

Free

Old Gen

To Space (minor)

Nursery


Read barrier
Read Barrier Cleanliness

Unconditional (Brooks style)

From

From

To

To

Conditional (Baker Style)


Read barrier1
Read Barrier Cleanliness

Unconditional (Brooks style)

From

From

F

F

To

To

pointer readBarrier (pointer *p) {

return *(pointer*)(p – IND_OFF);

}

pointer readBarrier (pointer *p) {

if (*(Header*)(p – HD_OFF) == F)

return *(pointer*)p;

return p;

}

Has Conditional Check

Needs extra header word

Conditional (Baker Style)


Read barrier optimizations
Read Barrier Optimizations Cleanliness

Stacks and registers never point to forwarding pointers

“Eager” read barriers (D.Bacon et al. POPL’93)

Scan stack after exporting write

Exporting write is a GC safe-point

Reduces RB overhead by ~5%


Adding multi core support
Adding Multi-core support Cleanliness

Nursery

- Reserved

Memory manager sees multiple mutators

Avoid synchronization on allocation

Local Allocation Blocks (LAB) - CAS


ad