Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk

1 / 27

# Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk - PowerPoint PPT Presentation

## Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Practical non-blocking data structuresTim Harristim.harris@cl.cam.ac.uk Computer Laboratory

2. Overview • Introduction • Lock-free data structures • Correctness requirements • Linked lists using CAS • Multi-word CAS • Conclusions

3. Introduction • What can go wrong here? Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; int getNumber () { int t; t = next; next = t + 1; return t; } } t = 0 t = 0  result=0 result=0 next = 0 next = 1

4. Introduction (2) • What about now? Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } } Lock acquired t = 0 Lock released result=0 result=1 next = 0 next = 1 next = 2

5. Introduction (3) • Now the problem is liveness Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } } Priority inversion: 1 is low priority, 2 is high priority, but some other thread 3 (of medium priority) prevents 1 making any progress Sharing: suppose that these operations may be invoked both in ordinary code and in interrupt handlers… Failure: what if thread 1 fails while holding the lock? The lock’s still held and the state may be inconsistent

6. Introduction (4) • In this case a non-blocking design is easy: class Counter { int next = 0; int getNumber () { int t; do { t = next; } while (CAS (&next, t, t + 1) != t); return t; } } Atomic compare and swap New value Expected value Location

7. Correctness • Safety: we usually want a ‘linearizable’ implementation (Herlihy 1990) • The data structure is only accessed through a well-defined interface • Operations on the data structure appear to occur atomically at some point between invocation and response • Liveness: usually one of two requirements • A ‘wait free’ implementation guarantees per-thread progress • A ‘non-blocking’ implementation guarantees only system-wide progress

8. Overview • Introduction • Linked lists using CAS • Basic list operations • Alternative implementations • Extensions • Multi-word CAS • Conclusions

9. 20 Lists using CAS • Insert 20:  30  20 30 30 H 10 10 T

10. 20 25 Lists using CAS (2) • Insert 20:  30  20  30  25 30 H 10 T

11. Lists using CAS (3) • Delete 10:  10  30 30 30 H H 10 10 T

12. 20 Lists using CAS (4) • Delete 10 & insert 20:   10  30 30  20 30 30 30 30 H H H H 10 10 10 10 T 

13.    10  30 10  30 30  30X 20 30  20 Logical vs physical deletion • Use a ‘spare’ bit to indicate logically deleted nodes:  30 30 H H 10 T 

14. Write barrier 30  20 20 Implementation problems • Also need to consider visibility of updates   30 H 10 T

15. 20 val = ??? Implementation problems (2) • …and the ordering of reads too while (val < seek) { p = p->next; val = p->val; }  30 30 10 H 10 T

16. Overview • Introduction • Linked lists using CAS • Multi-word CAS • Design • Results • Conclusions

17. Multi-word CAS • Atomic read-modify-write to a set of locations • A useful building block: • Many existing designs (queues, stacks, etc) use CAS2 directly (e.g. Detlefs ’00) • More generally it can be used to move a structure between consistent states • We’d like it to be non-blocking, disjoint-access parallel, linearizable, and efficient with natural data

18. …none of them practicable Parallel Requires Reserved bits p processors, word size w, max n locations, max a addresses Previous work • Lots of designs…

19. Design  Build descriptor  Acquire locations 0x100 H  Decide outcome 0x104 DCSS (&status, UNDECIDED, 0x10C, 0x110, &descriptor) DCSS (&status, UNDECIDED, 0x114, 0x118, &descriptor)  Release locations CAS (&status, UNDECIDED, SUCCEEDED) 0x108 10 CAS (0x114, &descriptor, null) CAS (0x10C, &descriptor, 0x118) 0x10C 0x110 status=SUCCEEDED status=UNDECIDED 20 0x114 null locations=2 0x118 T a1=0x10C o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2=<null> 0x11C

20. Reading word_t read (addr_t a) { word_t val = *a; if (!isDescriptor(val)) return val else { SUCCEEDED => return new value; return old value; } } 0x100 H 0x104 0x108 10 0x10C 0x110 status=UNDECIDED 20 0x114 locations=2 0x118 T a1=0x10c o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2=<null> 0x11C

21. 0x108 10 0x10C ac=0x200 oc=0 au=0x10C ou=0x110 nu=0x200 Whither DCSS? • Now we need DCSS from CAS: • Easier than full CAS2: the locations used for ‘control’ and ‘update’ addresses must not overlap, only the ‘update’ address may be changed + we don’t need the result • DCSS(&status, UNDECIDED 0x10C, 0x110, &descriptor): if (*0x200 == 0) CAS (0x10C, &DCSSDescriptor, 0x200) else CAS (0x10C, &DCSSDescriptor, 0x110); CAS (0x10C, 0x110, &DCSSDescriptor)

22. Evaluation: method • Attempt to permute elements in a vector. Can control: • Level of concurrency • Length of the vector • Number of elements being permuted • Padding between elements • Management of descriptors 60 54 76 43 6 45 23

23. Evaluation: small systems CASn width (words permuted per update) Algorithm used • gargantubrain.cl: 4-processor IA-64 (Itanium) • Vector=1024, Width=2-64, No padding • s per successful update

24. Evaluation: large systems • hodgkin.hpcf: 64-processor • Origin-2000, MIPS R12000 • Vector=1024, Width=2 • One element per cache line MCS IR ms per successful update HF-RC Number of processors

25. Overview • Introduction • Linked lists using CAS • Multi-word CAS • Conclusions

26. Conclusions • Some general techniques • The descriptor pointers serve two purposes: • They allow ‘helpers’ to find out the information needed to complete their work. • They indicate ownership of locations • Correctness seems clearest when thinking about the state of the shared memory, not the state of individual threads • Unlike previous work we need only a small and constant number of reserved bits (e.g. 2 to identify descriptor pointers if there’s no type information available at run time)

27. Conclusions (2) • Our scheme is the first practical one: • Can operate on general pointer-based data structures • Competitive with lock-based schemes • Can operate on highly parallel systems • Disjoint-access parallel, non-blocking, linearizable http://www.cl.cam.ac.uk/~tlh20/papers/hfp-casn-submitted.pdf