1 / 30

Fence Scoping

This research explores the concept of scoped fences to improve memory reordering in multiprocessors, allowing programmers to specify the scope of memory operations and reducing the number of ordering constraints. It also discusses compiler and hardware support for implementing scoped fences.

rollandd
Download Presentation

Fence Scoping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh

  2. Reordering in Uniprocessors • Memory operations are reordered to improve performance • Hardware (e.g., store buffer, reorder buffer) • Compiler (e.g., code motion, caching value in register) • No harm as long as dependences are respected a1: St x a2: Ld y a2: Ld y a1: St x

  3. Reordering in Multiprocessors • counter-intuitive program behavior Initially x=y=0 a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; P1P2 a2: y = 1; a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; b2: Rx = x; a1: x = 1; b1: Ry = y; a1: x = 1; b2: Rx = x; a2: y = 1; b2: Rx = x; a2: y = 1; a2: y = 1; Intuitively, y=1  x=1 Ry=1  Rx=1 (Rx=0, Ry =0) a1: x = 1; (Rx=0, Ry =1) (Rx=1, Ry =0) a2: y = 1; (Rx=1, Ry =1)

  4. Reordering in Multiprocessors • counter-intuitive program behavior Initially p=NULL, flag = false P1P2 p = new A(…) if (flag) a = p->var; flag = true; flag is supposed to be set after p is allocated

  5. Fence Instructions • Memory Consistency Models • Specify what reordering is allowed • e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) • Fence Instructions (Fences/Memory barriers) • Selectively override default relaxed memory order • Order memory operations before and after the fence P1 p = new A(…) FENCE flag = true;

  6. Fence Instructions • Memory Consistency Models • Specify what reordering is allowed • e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) • Fence Instructions (Fences/Memory barriers) • Selectively override default relaxed memory order • Order memory operations before and after the fence • Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11] • Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]

  7. Motivation • Not all memory orderings enforced by fences are necessary • Fences are usually used to enforce some specific memory operations • Programmers know better how a fence is used, which can be conveyed to the hardware Control Data Access Concurrent algorithm Process Data

  8. Scoped Fence (S-Fence) • A S-Fence only orders memory operations in the scope • Scope definition (Class scope, Set scope) • Bridge the gap between programmers’ intention and hardware execution • Programmers specify the scope • Scope information is conveyed to hardware, imposing fewer ordering constraints • Lightweight hardware and compiler support

  9. Scoped Fence (S-Fence) • Programming support S-FENCE global scope S-FENCE[class] class scope S-FENCE[set, {var1, var2, …}] set scope

  10. Work-Stealing Queue Algorithm • TASK take ( ){ • tail = TAIL – 1; • TAIL = tail; • FENCE // store-load • head = HEAD; • if (tail<head){ • TAIL = head; • return EMPTY; • } • … … • return task • } • void put (TASK task){ • tail = TAIL; • wsq[tail] = task; • FENCE // store-store • TAIL = tail+1; • } • TASKsteal ( ){ • head = HEAD; • tail = TAIL; • … … • return task; • } Chase-Lev lock-free concurrent work-stealing queue

  11. Parallel Spanning Tree • tail = TAIL – 1; • TAIL = tail; • FENCE • head = HEAD; • …… • color[task’] = label; • parent[task’] = task; • tail = TAIL; • wsq[tail] = task’; • FENCE • TAIL = tail + 1; ① FENCE • task = wsq.take(); • for (each neighbor task’ of task) • if (task’ is not processed){ • process(task’); • wsq.put(task’) ; • } ② ③ FENCE (a) (b)

  12. Class Scope • S-FENCE[class] class scope • Make use of class in OO languages to illustrate the concept • Constrain a fence to the object class where it is used (Encapsulation) • Intuition: function members operate on data members of the class

  13. Class Scope • S-FENCE[class] class scope class B { int n1, n2; void funcB() { n1 = val3; S-FENCE2[class] n2 = val4; } } class A { B b; int m1, m2; void funcA() { m1 = val1; b.funcB(); S-FENCE1[class] m2 = val2; } } S-FENCE1: m1, m2, n1, n2 S-FENCE2: n1, n2

  14. Class Scope Semantics More details in paper

  15. Parallel Spanning Tree • tail = TAIL – 1; • TAIL = tail; • FENCE • head = HEAD; • …… • color[task’] = label; • parent[task’] = task; • tail = TAIL; • wsq[tail] = task’; • FENCE • TAIL = tail + 1; ① SFENCE[class] • task = wsq.take(); • for (each neighbor task’ of task) • if (task’ is not processed){ • process(task’); • wsq.put(task’) ; • } ② ③ SFENCE[class] (a) (b)

  16. Compiler Support • ISA Extension • class-fence • fs_start – start of a fence scope • fs_end – end of a fence scope Use fs_start and fs_end to embrace functions containing fences • Informing hardware to mark memory operations properly

  17. Hardware Support Reorder Buffer Store Buffer ... ... • Fence Scope Bits (FSB) • Each entry of ROB and store buffer is associated with FSB • Flag whether a memory operation is in the scope of some fence Fence Scope Bits (FSB) • Decoding - memory operations in the scope are marked via FSB • Fence issue - check the entry for current scope

  18. Hardware Support Reorder Buffer Store Buffer ... ... • Fence Scope Bits (FSB) • Each entry of ROB and store buffer is associated with FSB • Flag whether a memory operation is in the scope of some fence Fence Scope Bits (FSB) • Decoding - memory operations in the scope are marked via FSB • Fence issue - check the entry for current scope

  19. Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 fs_start b outer inner fs_end b fs_end a 0 1 2 3

  20. Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 fs_start b outer inner fs_end b fs_end a 0 1 2 3

  21. Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 Issue Fence fs_start b • by checking FSB on the current scope outer inner fs_end b fs_end a 0 1 2 3

  22. Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 Issue Fence fs_start b • by checking FSB on the current scope outer inner fs_end b fs_end a 0 1 2 3

  23. Why S-Fence performs Better? St A St A St A St A St X St X Store Buffer drained & Fence issued stall stall stall Traditional Fence ...... SB Ld Y ROB St B 0 1 2 3 4 St A St X Timeline FENCE stall St A : a cache miss Scoped Fence Ld Y SB St B Ld Y ROB St B

  24. Set Scope • Dekker algorithm Initially flag1 = flag2 = 0 P1P2 m1 = … m2 = … flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical sectioncritical section FENCE FENCE

  25. Set Scope • Dekker algorithm Initially flag1 = flag2 = 0 P1P2 m1 = … m2 = … flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical sectioncritical section S-FENCE … S-FENCE[set, {flag1, flag2}]

  26. Set Scope • S-FENCE[set, {var1, var2, …}] set scope • only order memory accesses to {var1, var2, …} • Compiler and Hardware Supports • flag memory accesses to the specified variables • set fence scope bits in hardware for flagged memory accesses • For simplicity, we do not differentiate memory accesses to different sets

  27. Experimental Evaluation • Cycle-accurate simulation (SESC) • Integrate scoped fence logic • RMO memory model • Benchmarks • pst - parallel spanning tree (work-stealing queue, class scope) • ptc – parallel transitive closure (work-stealing queue, class scope) • barnes – from SPLASH2 (fences inserted for SC, set scope) • radiosity – from SPLASH2 (fences inserted for SC, set scope)

  28. Experimental Evaluation Traditional fence (T) vs. Scoped fence (S) set scope class scope ~13% Fence Stall Reduced ~50% ~40-50%

  29. Conclusion • Introduce the concept of fence scope • Propose class scope and set scope • OpenCL 2.0 (sub-group, work-group, device, system) • Lightweight compiler and hardware support • No change in inter-processor communication Fence scope should be implemented in some form !

  30. Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh

More Related