Eliminating read barriers through procrastination and cleanliness
This presentation is the property of its rightful owner.
Sponsored Links
1 / 52

Eliminating Read Barriers through Procrastination and Cleanliness PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Eliminating Read Barriers through Procrastination and Cleanliness. KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan. MultiMLton. Deterministic Parallelism Effect Isolation. Asynchronous CML (ACML) Parasitic Threads GC?. MLton for many-cores

Download Presentation

Eliminating Read Barriers through Procrastination and Cleanliness

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Eliminating read barriers through procrastination and cleanliness

Eliminating Read Barriers through Procrastination and Cleanliness

KC Sivaramakrishnan

Lukasz Ziarek

Suresh Jagannathan


Multimlton

MultiMLton

  • Deterministic Parallelism

  • Effect Isolation

  • Asynchronous CML (ACML)

  • Parasitic Threads

  • GC?

  • MLton for many-cores

    • Standard ML – functional PL with side-effects

  • Goals – Safe and Scalable programs


Multimlton runtime system

MultiMLton - Runtime System

  • User-level threads

  • Preemptive scheduling

  • Work-pushing

Asynchronous CML

Scheduler Substrate

SML

One-shot continuations

Parasitic Threads

VProc

VProc

VProc

VProc

C


Stop the world serial gc

Stop-the-world, Serial GC

  • MLton GC  MultiMLton GC quickly

  • Sansom’s “Dual-mode garbage collection”

    • Dynamically switch between 2-space to 1-space

    • Cheney’s copying  Jonkers’ sliding mark-compact

    • No fragmentation

    • Bump-pointer allocation

    • Appel’s Generational collection

  • Adding multicore support

    • Memory allocated modified for local allocation

    • GC is still stop-the-world serial


How did we do

How did we do?


Many core architectural trends

Many-core architectural trends

AMD “MagnyCours”

48-cores

Tilera Tile64

64-cores

Intel SCC

48-cores

  • Many-core architectural trends

    • NUMA effects

    • Cache coherence


Local collector

Local collector

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc


Local collector1

Local collector

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc


Local collector2

Local collector

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc


Local collector3

Local collector

Shared Heap

Local Heap

Local Heap

Local Heap

Local Heap

VProc

VProc

VProc

VProc

No Synchronization for local allocation/collection!

Local collection is Samson’s Dual mode

Shared heap is not the nth generation


Thread local collectors

Thread-local collectors

D. Doligez et al. (POPL’93) – SML with threads

R. Jones et al. (SCAM '05) – Java

B. Steensgaard (ISMM ’00) – subset of Java

T. Anderson (ISMM’10) – A variant of MIT’s pH

S. Marlow et al. (ISMM’11) –GHC

S. Auhagen et al. (MSPC’11)– Manticore


Write barrier

Write Barrier

Shared Heap

r := x

r

Target

Exporting writes

Local Heap

Source

x


Write barrier1

Write Barrier

Shared Heap

r := x

r

x

Transitive closure of x

Local Heap


Write barrier2

Write Barrier

Shared Heap

r := x

r

x

Transitive closure of x

Local Heap

x


Write barrier3

Write Barrier

Shared Heap

r := x

r

x

Local Heap

FWD

Mutator needs read barrier

Mutations <<< Reads


Read barrier overheads

Read Barrier Overheads

20.1 %

15.3 %

21.3 %


Read barrier statistics

Read Barrier Statistics

pointer readBarrier (pointer *p) {

if (getHeader(p) == FORWARDED)

return *(pointer*)p;

return p;

}

Checks

Forwarded


Eliminate read b arriers

Eliminate read barriers?

  • No need for read barriers if mutator can never witness forwarded objects

    • Do a local GC every time you export

    • Slower than with read barriers

  • Dynamically ensure mutators never get to see forwarded objects.

    • Procrastination: Exploit program concurrency to delay exporting writes

    • Cleanliness: Object closure cleanliness


New idea procrastination

New idea: Procrastination

T1

T2

Shared Heap

r1

r2

 r1 := x1

r2 := x2

Local Heap

x1

x2

T

 T is running

T

 T is suspended

T

 T is blocked


Procrastination

Procrastination

T1

T2

Shared Heap

r1

r2

r1 := x1

 r2 := x2

Control switches to T2

Local Heap

x1

x2

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Procrastination1

Procrastination

T1

T2

Shared Heap

r1

r2

r1 := x1

r2 := x2

Local Heap

x1

x2

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Procrastination2

Procrastination

T1

T2

Shared Heap

r1

x1

r2

x2

r1 := x1

r2 := x2

Local Heap

x1

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Procrastination3

Procrastination

T1

T2

Shared Heap

r1

x1

r2

x2

r1 := x1

 …

r2 := x2

Local Heap

Force local GC

T

 T is running

Delayed write list 

T

 T is suspended

T

 T is blocked


Is procrastination alone enough

Is Procrastination alone enough?

Procrastination depends on availability of Runnable threads @ exporting write

Runnable threads << Total threads (Thread Density)

Eager exporting writes preserving “mutator never sees forwarding pointers” invariant.


Exporting write characteristics

Exporting write characteristics

  • Sources of exporting writes

    • Immutable >> Mutable

    • Tend to be young

    • References rarely from outside the closure (other than stacks)

  • Object closure cleanliness

    • Heap Sessions (Young objects)

    • Reference counts (Safety of eager export)


Heap session

Heap Session

Local Heap

Previous Session

Current Session

Free

SessionStart

Frontier

  • Sessions closed/started after a

    • User-level thread switch

    • Exporting write

    • Local GC


Reference counting

Reference Counting

Local heap allocated object

Object in current session

Count number of references to current session objects

Does not consider references from stacks or registers

Count is one of ZERO, ONE, LOCAL_MANY, GLOBAL


Cleanliness

Cleanliness

  • An object closure is said to be clean, if for each object O in the closure

    • O is immutable or is in the shared heap. Or,

    • O is the root, and has ZERO references. Or,

    • O is not the root, and has ONE reference. Or,

    • O is not the root, has LOCAL_MANY references, and is in the current session.


Cleanliness1

Cleanliness

  • An object closure is said to be clean, if for each object O in the closure

    • O is immutable or is in the shared heap. Or,

    • O is the root, and has ZERO references. Or,

    • O is not the root, and has ONE reference. Or,

    • O is not the root, has LOCAL_MANY references, and is in the current session.


Cleanliness2

Cleanliness

  • An object closure is said to be clean, if for each object O in the closure

    • O is immutable or is in the shared heap. Or,

    • O is the root, and has ZERO references. Or,

    • O is not the root, and has ONE reference. Or,

    • O is not the root, has LOCAL_MANY references, and is in the current session.

  • Boils down to 2 cases:

    • Tree-structured closure

    • Arbitrary Graph


Tree structured closure

Tree-structured closure


Graph session based

Graph – Session Based

Trace current session


Write barrier4

Write Barrier

1: ValwriteBarrier (Ref r, Val v) {

2:if(isInSharedHeap (r) && isInLocalHeap (v)) {

3: needsFixup= False;

4:if(isClean(v, &needsFixup))

5: v = lift(v, needsFixup); //lift eagerly

6:else

7: v = suspendTillGCAndLift (v); //delay write

8: }

9: return v;

10:}


Write barrier5

Write Barrier

  • Summary

    • Read barrier are expensive in MultiMLton

    • Eliminate read barriers by avoiding mutator from ever witnessing forwarding pointers

1: ValwriteBarrier (Ref r, Val v) {

2:if(isInSharedHeap (r) && isInLocalHeap (v)) {

3: needsFixup= False;

4:if(isClean(v, &needsFixup))

5: v = lift(v, needsFixup); //lift eagerly

6:else

7: v = suspendTillGCAndLift (v); //delay write

8: }

9: return v;

10:}


Benchmark characteristics

Benchmark Characteristics

Lots of concurrency

Low sharing


Performance on amd

Performance on AMD

At 3X:

---------

RB+ 32%

STW106%

BDW584%


Performance on azul

Performance on AZUL

At 3X:

---------

RB+30%


Multimlton scc implementation

MultiMLton - SCC implementation

Shared heap

Local heap

  • No cache-cache coherence

    • Cluster-on-chip Architecture

  • Private off-die DRAM Regions (one per Core)

    • Caches enabled! One Linux instance per Core!

    • Local heaps reside here

  • Shared / Global off-die DRAM Region

    • Caches disabled per default!

    • Shared heap resides here

  • Shared on-die MPB Regions

    • Cached in L1, L2 Bypass / Fast L1 Invalidation for MPB-Data

    • Coordinating VProcs


Performance on scc

Performance on SCC

At 3X:

---------

RB+20%


Cleanliness impact 1

Cleanliness Impact (1)


Cleanliness impact 2

Cleanliness Impact (2)

Low thread density


Session impact

Session Impact


Conclusion

Conclusion

  • Local collectors seem to be a good choice for many-core architectures

    • Better Cache Behavior

    • Minimize NUMA effects

    • Overcome cache coherence issues (partially)

  • Read barriers in local collectors can be expensive

  • Eliminate them through procrastination and cleanliness


Backup slides

Backup slides


Mlton heap layout

MLton Heap Layout

From Space (major)

Nursery

Heap

To Space (major)

Old Gen

To Space (minor)

Nursery


Mlton gc minor collection

MLton GC – Minor Collection

To Space (major)

Old Gen

To Space (minor)

Nursery

To Space (major)

Old Gen

To Space (minor)

Nursery


Mlton gc major copying collection

MLton GC – Major Copying Collection

To Space (major)

Old Gen

Old Gen

To Space (minor)

Nursery

To Space (major)

From Space


Mlton gc major mark compact

MLton GC – Major Mark-Compact

Old Gen

Free

Old Gen

To Space (minor)

Nursery


Read barrier

Read Barrier

Unconditional (Brooks style)

From

From

To

To

Conditional (Baker Style)


Read barrier1

Read Barrier

Unconditional (Brooks style)

From

From

F

F

To

To

pointer readBarrier (pointer *p) {

return *(pointer*)(p – IND_OFF);

}

pointer readBarrier (pointer *p) {

if (*(Header*)(p – HD_OFF) == F)

return *(pointer*)p;

return p;

}

Has Conditional Check

Needs extra header word

Conditional (Baker Style)


Read barrier optimizations

Read Barrier Optimizations

Stacks and registers never point to forwarding pointers

“Eager” read barriers (D.Bacon et al. POPL’93)

Scan stack after exporting write

Exporting write is a GC safe-point

Reduces RB overhead by ~5%


Adding multi core support

Adding Multi-core support

Nursery

- Reserved

Memory manager sees multiple mutators

Avoid synchronization on allocation

Local Allocation Blocks (LAB) - CAS


  • Login