Multicore programming

Multicore programming Course website: http://tbrown.pro/cs798 Hash Table Expansion, Linked Data Structures Lecture 8 Trevor Brown

Announcements • A4 is still in the works… • A2 and A3 grades soon… • Only small parts left ungraded. I’m the bottleneck…

Last time • Probing vs chaining • Hash function quality • Started hash table expansion • This time: • Finishing hash table expansion • Starting linked data structures

Hash table expansion Clarifying and finishing up after last time

Rough Implementationsketch Atomic pointer to current table struct structhashmap 1 charpadding0[64]; 2 atomic<table*>currentTable; 3 charpadding1[64]; /* code for operations ... */ structtable old stays around so expansion can be done… old stays around so expansion can be done… 1 charpadding0[64]; 2 atomic<uint64_t> * data; 3 atomic<uint64_t> * old; 4 int capacity; 5 int oldCapacity; 6 counter * approxSize; 7 atomic<int>chunksClaimed; 8 atomic<int> chunksDone; 9 charpadding1[64]; Erratum: changed since last lecture!

Recall from last time Check if we need to expand, and start expansion as necessary, or help ongoing expansion. If we start or help expansion, retry our insert (in the new table) inthashmap::insert(intkey) table*t=currentTable; inth=hash(key); for(inti=0; i< t->capacity; ++i){ if(expandAsNeeded(t,i)) returninsert(key); intindex=(h+i)%t->capacity; intfound=t->data[index]; if(found&MARKED_MASK)returninsert(key); elseif(found==key)returnfalse; elseif(found==NULL){ if(CAS(&t->data[index],NULL,key))returntrue; else{ found=t->data[index]; if(found&MARKED_MASK)returninsert(key); elseif(found==key)returnfalse; } } } assert(false); Found evidence of expansion…restart to help / get into the new table

Clarifyingthe last lecture Important! Last time I made a mistake... Actually cannot let threads insert into the new table until after expansion is done! What about this? Wait until expansion is finished before returning!

Making migration more efficient • Typical index function to get a bucket index from a key: • index = hash(key) % capacity • If capacity doubles, indexes of keys are scrambled • Hash 17 in array of size 12: bucket 5 in array of size 24: bucket 17 • Hash 42 in array of size 12: bucket 8  in array of size 24: bucket 18 • Scaled index function • index = floor( hash(key) / largestHashPossible * capacity) • If capacity doubles, indexes of keys are doubled • Hash 17 in array of size 12: bucket 5  in array of size 24: bucket 10 • Hash 42 in array of size 12: bucket 8  in array of size 24: bucket 16 • With predictable indexes, can expand more efficiently!

idea 7 6 3 4 2 Old table One thread can copywithout synchronization 7 6 2 3 4 New table

More complex data structures

What else is worth understanding? • We’ve seen hash tables… • What about node based data structures? • (That aren’t just a single pointer like stacks, or two pointers like queues) • Singly-linked lists, doubly-linked lists, skip-lists, trees, tries, hash tries, … • New challenges: • Nodes get deleted when threads might be trying to work on them • Operations may require atomic changes to multiple nodes

Lock-basedsingly-Linked lists Is this a good approach? Locking causes many cache invalidations, even for searches! • Ordered set implemented with singly-linked list • Hand-over-hand locking discipline: • must lock a node before accessing it • Can only acquire a lock on a node: • if it is the list head, orif you already hold a lock on the previous node • Delete(15) • Insert(17) Should avoid locking while searching/traversing the list! 23 11 20 7 8 15 head 23 11 20 7 8 15 head 17

Lock-free singly-Linked lists:Attempting to use CAS • Ordered set implemented with singly-linked list • Delete(15) • Traverse list, then CAS .next from to • Insert(17) • Traverse list, create node , then CAS .next from to One approach is to design a completely lock-free list… 7 15 20 20 17 7 17 7 15 20 head 17

The problem • What if the operations are concurrent? • Delete(15): pause just before CAS .next from to • Insert(17): traverse list, create node ,then CAS .next from to • Delete(15): resume and CAS .next from to 7 15 20 17 20 17 15 7 15 20 7 15 20 head Erroneously deleted 17! 17

Solution: marking [Harris2001] • Idea: prevent changes to nodes that will be deleted • Before deleting a node, mark its nextpointer • How does this fix the Insert(17), Delete(15) example? • Delete(15) marks before using CAS to delete it • Insert(17) cannot modify .next because it is marked 15 15 7 15 20 head Okay. We can do lists. 17 Note: you can also do fast lock-based lists that avoid locking while searching…

What about removing several nodes? • Deleting consecutivenodes in a list… • Delete(15 AND 20) • Mark 15, then mark 20? • What can go wrong… • Or performing rotations in treesby replacing nodes… 27 7 15 20 head D D A A C C B B

Or changing two pointers at once? • Doubly-linked list • Insert(17) • If the two pointer changes are not atomic • Insertions and deletions could happen between them • Example: after 15.next := 17, but before 20.pred := 17, someone inserts between 17 and 20 7 15 20 17

Easy Lock-based Doubly-linked list • Doubly-linked list • Insert(17) • Simplest locking discipline • Never access anything without locking it first • Correct, but at what cost? • To respect the locking discipline, we have to lockwhilesearching! succ pred 7 15 20 17

Can we search a doubly-linked list without locking nodes? Insert(17) succ pred • Insert(k): • Search without lockinguntil we reach nodes pred & succwhere pred.key < k <= succ.key • If we found k, return false • Lock pred, lock succ • If pred.next != succ, unlock and retry • Create new node n • pred.next = n • succ.prev = n • Unlock all 7 15 20 17 • Contains(k): • pred = head • succ = head • Loop • If succ == NULL or succ.key > k then return false • If succ.key == k then return true • succ = succ.next Where do we linearize contains? No single line of code works… Must prove a suitable LP existsfor every operation in every execution

What if we havedifferent types of searches? • Could imagine an application that wants a doubly linked list so: • Some threads can search left-to-right (containsLR) • Some threads can search right-to-left (containsRL) • Can we linearize insertions in such an algorithm?

Lock-free Bi-directional searches complicate linearization Where should we linearizea successful insert? • Insert(k): • Search without lockinguntil we reach nodes pred & succwhere pred.key < k <= succ.key • If we found k, return false • Lock pred, lock succ • If pred.next != succ, unlock and retry • Create new node n • pred.next = n • succ.prev = n • Unlock all Case 1: linearize here Case 2: linearize here Insert(k) was not linearized yet: should NOT find k! Insert(k) was linearized already: should find k! pred.next = n succ.prev = n thread p Insert(k) SearchL(k) thread q Does NOT find k SearchR(k) thread r Finds k time

Making two changes appear atomic • Something stronger than CAS? • Double compare-and-swap (DCAS) • Like CAS, but on any two memory locations • DCAS(addr1, addr2, exp1, exp2, new1, new2) • Not implemented in modern hardware • But we can implement it in software, using CAS!

DCAS object: sequential semantics DCAS(addr1,addr2,exp1,exp2,new1,new2) • Usage - addresses that are modified by DCAS: • must notbe modified with writes/CAS • must be read using DCASRead atomic{ if(*addr1==exp1&&*addr2==exp2){ *addr1=new1; *addr2=new2; returntrue; } else returnfalse; } DCASRead(addr) return the value last stored in *addr by a DCAS

DCAS-based doubly-linked list • Add sentinel nodes to avoid edge cases when list is empty • Consequence: never update head or tail pointers • Use DCAS to change pointers (but not keys) • Consequence: must use DCASRead to read pointers (but not keys) • Note: no need to read head or tail with DCASRead! head tail 15 20

First attempt at an Implementation Contains(23) pair<node,node>InternalSearch(key_tk) succ succ succ succ 1 pred=head 2 succ=head 3 while(true) 4 if(succ==NULLorsucc.key>=k) 5 returnmake_pair(pred,succ); 6 pred=succ; 7 succ=DCASRead(succ.next); 27 15 20 InternalSearch returns pointers to these Contains(23) sees succ.key != k,and returns false boolContains(key_tk) 8 pred,succ=InternalSearch(k); 9 return(succ.key==k); InternalSearch postcondition:pred.key < k ≤ succ.key

First attempt at an Implementation boolInsert(key_tk) succ pred 10 while(true) 11 pred,succ=InternalSearch(k); 12 if(succ.key==k)returnfalse; 13 n=newnode(k); 14 if(DCAS(&pred.next,&succ.prev,succ,pred,n,n)) 15 returntrue; 16 elsedeleten; 15 20 n 17 boolDelete(key_tk) pred 17 while(true) 18 pred,succ=InternalSearch(k); 19 if(succ.key!=k)returnfalse; 20 after=DCASRead(succ.next); 21 if(DCAS(&pred.next,&after.prev,succ,succ,after,pred)) 22 returntrue; // not covered: how to free succ succ after 15 20 17

Is this algorithm correct? DCAS helps with this • Recall: main difficulties in node-based data structures • Atomically modifying two or more variables • Preventing changes to deleted nodes Can we argue deleted nodes don’t get changed? pred succ after 15 20 17 Observation: No node points to succ once it is deleted Invariant: no nodepoints to a deleted node Plausible lemma: whenever we change a node, another node points to it

Can we prove the plausible lemma? The DCAS in Insert succeedsand changes pred and succonly if they point to each other! Plausible lemma: whenever we change a node, another node points to it boolInsert(key_tk) succ pred 10 while(true) 11 pred,succ=InternalSearch(k); 12 if(succ.key==k)returnfalse; 13 n=newnode(k); 14 if(DCAS(&pred.next,&succ.prev,succ,pred,n,n)) 15 returntrue; 16 elsedeleten; 15 20 n 17 The DCAS in Delete succeedsand changes pred and afterif they both point to succ… boolDelete(key_tk) 17 while(true) 18 pred,succ=InternalSearch(k); 19 if(succ.key!=k)returnfalse; 20 after=DCASRead(succ.next); 21 if(DCAS(&pred.next,&after.prev,succ,succ,after,pred)) 22 returntrue; // not covered: freeing succ pred succ after 15 20 17 Could maybe succeed even if nothing points to pred or after!

A Counter example Thread p: start Delete(20), find pred, succ, after Thread p: sleep just before executingDCAS(&pred.next, &after.prev, succ, succ, after, pred) Thread q: Delete(17) Thread q: Delete(25) Thread p: DCAS succeeds, modifying deleted nodes!Delete(20) returns true, but 20 is not deleted! pred succ after 15 17 25 27 20

Overcoming this problem: marking • Recall: marking can prevent changes to deleted nodes • How to atomically change two pointers AND mark other pointers/nodes using DCAS? • Use an even stronger primitive… • k-word compare-and-swap (KCAS) • Like a CAS that atomically operations on k memory addresses • Can be implemented in software from CAS

Multicore programming