Keir fraser tim harris
This presentation is the property of its rightful owner.
Sponsored Links
1 / 77

Concurrent Programming Without Locks PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

Keir Fraser & Tim Harris. Concurrent Programming Without Locks. Motivation. Locking introduces dependencies among threads Non-blocking solutions keep threads independent, but they are complicated to program or depend on unrealistic instructions (CAS2)

Download Presentation

Concurrent Programming Without Locks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Keir fraser tim harris

Keir Fraser & Tim Harris

Concurrent Programming Without Locks


Motivation

Motivation

  • Locking introduces dependencies among threads

  • Non-blocking solutions keep threads independent, but

    • they are complicated to program

    • or depend on unrealistic instructions (CAS2)

  • Need a practical and general non-blocking solution


Solutions

Solutions?

  • Only use data structures that can be implemented with CAS?

    • Limiting

  • Build MCAS in software using CAS

  • Build Transactional Memory in software using CAS


Goals

Goals

  • Concreteness

  • Linearizability

  • Non-blocking progress guarantee

  • Disjoint access parallelism

  • Read parallelism

  • Dynamicity

  • Practicable space costs

  • Composability


Definitions

Definitions

  • Obstruction freedom– a thread will make progress as long as it doesn’t contend with other threads access to any location

  • Lock-freedom – The system as a whole will make progress

  • Wait-freedom – Every thread makes progress

    Focus is on Lock-free design

    Whole transactions are lock-free, not just the sub-components


The basic problem

The Basic Problem

  • How can we conditionally update multiple locations atomically – using “real” instructions that can only update a single location atomically?

  • The trick

    • Introduce a level of indirection

    • Use descriptors to access values indirectly


How do we use indirection

How do we use indirection?

  • Memory locations involved in a transaction or MCAS have

    • old and new uncommitted values stored in a descriptor

    • a status field determines which value to use

    • we must be careful how status is updated!

  • Memory locations not involved in a transaction can hold their value directly

    • requires tidying up after transactions commit


Concurrent programming without locks

Status

Address

Old Value

New Value

102

100

200

105

123

123

106

456

789

Indirect Memory Access

100

Descriptor

101

102

103

104

105

106

107

Memory


Direct or indirect

Direct or Indirect?

  • How do we know if the value in a location should be used directly or indirectly?

    • we can reserve some low order bits

    • interpret them on each access

    • but this limits the use of the approach to aligned pointers


Using descriptors in tm

Using Descriptors in TM

  • Commit operation atomically updates status field

    • we have to do it with CAS to avoid races

  • Once a descriptor is made visible, only the status field changes

    • Why?

    • How?

  • Once a transaction’s outcome is decided, the status value doesn’t change

    • Retries use a new descriptor … why?

  • Descriptors are managed via garbage collection


Other requirements

Other requirements

  • Descriptors must be able to own locations

    • one transaction must not unlink another

    • why?

    • so what should be done on a conflict, wait?

  • But doesn’t this introduce blocking?

    • not necessarily – contending threads could help the owner complete


Uncontended commits

Uncontended Commits

  • To be obstruction free, uncontended commits must succeed

  • The phases:

    • Prepare the transaction descriptor (use CCAS for each location accessed) to atomically link locations while outcome is undecided

    • Decide the transaction’s outcome and update the status field (using CAS)

    • Update memory (using CAS) and mark the descriptor for collection


Contended commits

Contended Commits

  • Contended Commits must make progress

    • If status is decided, but not complete

      • Help the other thread complete

    • If status is undecided, either

      • Abort contending transactions

        • needs contention management to prevent live-lock

      • Help contending transactions

        • need some way to ensure success of at least one transaction

    • Read-check, used in WSTM or OSTM to ensure read set is still current:

      • Abort at least one contender

      • Help, and ensure progress by ordering transactions


Three stm implementations

Three STM Implementations

MCAS Multiple Compare And Swap

WSTM Word Software Transactional Memory

OSTMObject Software Transactional Memory


Concurrent programming without locks

MCAS

CAS(word *address, // actual value

word expected_value,

word new_value);

MCAS(int count,

word *address[], // actual values

word expected_value[],

word new_value[]);

… but an extra indirection is added because pointers must indirect through the descriptor


Concurrent programming without locks

MCAS

  • Operates only on aligned pointers

    • enables use of 2 low order bits to distinguish values from descriptors

  • Descriptors contain

    • status {Success, Failure, Undecided}

    • N

    • address[ ]

    • expected[ ]

    • new_value[ ]


Data access examples

Status: SUCCESS

Address

Old Value

New Value

101

100

200

Status: UNDECIDED

Address

Old Value

New Value

107

100

200

Data Access Examples

descriptor

value

300

descriptor


The prepare phase

The Prepare Phase

  • Create MCAS descriptor

  • Insert descriptor address in each location

    • don’t overwrite other concurrent attempts

    • don’t keep working if another thread has already helped you succeed or fail

      • use CAS conditional on undecided status (CCAS)

    • MCAS descriptor must not become visible until its fully initialized

      • link CCAS descriptors in each location first then swap for MCAS descriptor using CCAS


Concurrent programming without locks

CCAS

Conditional CAS built from CAS

- takes effect only if condition == undecided

- used to insert descriptor references in two phases

CCAS(word *address,

word expected_value,

word new_value,

word *condition);

return original value of *address


Concurrent programming without locks

CCAS

  • Create a new private CCAS descriptor

  • Copy CCAS parameter values into it

  • Try to link it into the target location (using CAS)

  • On failure try to help whoever succeeded by using their CCAS descriptor

    • again using CAS

    • then retry your own


Concurrent programming without locks

word *CCAS(word **a, word *e, word *n,

word *cond) {

ccas_descriptor *d = new ccas_descriptor();

word *v;

(d->a, d->e, d->n, d->cond) = (a,e,n,cond);

while ( (v = CAS(d->a, d->e, d)) != d->e ) {

if ( IsCCASDesc(v) )

CCASHelp( (ccas_descriptor *)v);

else

return v;

}

CCASHelp(d);

return v;

}

void CCASHelp(ccas_descriptor *d) {

bool success = (*d->cond == UNDECIDED);

CAS(d->a, d, success ? d->n : d->e);

}


Cost in terms of cas

Cost in terms of CAS

  • CCAS takes at least 2 CAS to link the MCAS descriptor into each location

    • 2N CAS for N locatons

  • But we still have not committed the MCAS

    • at least 1 CAS required to set MCAS status

    • at least N CAS required to update the memory locations with the new values from the MCAS descriptor


Reading

Reading

  • We can’t simply read values anymore!

  • CCASRead must be used for reading

  • It must be able to read values directly and indirectly through CCAS descriptors

    • detect which situation it is in

    • function correctly in the presence of concurrent updates


Ccasread

CCASRead

  • Copy address to be read to local

  • Test to see if it’s a value or a descriptor

  • If it’s a descriptor help the thread whose descriptor it is complete

    • requires more CAS


Concurrent programming without locks

word *CCASRead(word **a) {

word *v = *a;

while ( IsCCASDesc(v) ) {

CCASHelp( (ccas_descriptor *)v);

v = *a;

}

return v;

}

void CCASHelp(ccas_descriptor *d) {

bool success = (*d->cond == UNDECIDED);

CAS(d->a, d, success ? d->n : d->e);

}


Reading1

Reading

  • We also need an MCASRead to read locations subject to MCAS

  • MCASRead used CCASRead to read the contents of the location

    • if its an MCAS descriptor it must find the address in the descriptor and determine whether to use the old or new values

    • this requires more CCAS


Putting it all together

Putting it all together

  • Example

    MCAS (3, {a,b,c}, {1,2,3}, {4,5,6})


Mcas 3 a b c 1 2 3 4 5 6

MCAS(3, {a,b,c}, {1,2,3}, {4,5,6})

a

1

b

2

c

3


Mcas 3 a c b 1 3 2 4 6 5

UNDECIDED

3

a

1

4

b

2

5

c

3

6

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5})

a

1

b

2

c

3


Mcas 3 a c b 1 3 2 4 6 51

UNDECIDED

3

a

1

4

b

2

5

c

3

6

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5})

a

1

b

2

c

3

CCAS Descr

a

1

&MCAS_Descr

&mcas->status


Mcas 3 a c b 1 3 2 4 6 52

UNDECIDED

3

a

1

4

b

2

5

c

3

6

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5})

a

b

2

c

3

CCAS Descr

a

1

&MCAS_Descr

&mcas->status


Mcas 3 a c b 1 3 2 4 6 53

UNDECIDED

3

a

1

4

b

2

5

c

3

6

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5})

a

b

2

c

3

CCAS Descr

a

1

&MCAS_Descr

&mcas->status


Mcas 3 a c b 1 3 2 4 6 54

UNDECIDED

3

a

1

4

b

2

5

c

3

6

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5})

a

b

c


Mcas 3 a c b 1 3 2 4 6 55

SUCCESS

3

a

1

4

b

2

5

c

3

6

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5})

a

b

c


Mcas 3 a b c 1 2 3 4 5 61

SUCCESS

3

a

1

4

b

2

5

c

3

6

MCAS(3, {a,b,c}, {1,2,3}, {4,5,6})

a

4

1

1

b

2

5

2

c

3

3

6


Mcas 3 a b c 1 2 3 4 5 62

MCAS(3, {a,b,c}, {1,2,3}, {4,5,6})

a

1

1

4

b

2

2

5

c

3

3

6


Concurrent programming without locks

bool MCAS(int N,

word **a[], word *e[], word *n[])

{

mcas_descriptor *d =

new mcas_descriptor();

d->N = N;

d->status = UNDECIDED;

for (int i=0; i<N; i++) {

d->a[i] = a[i];

d->e[i] = e[i];

d->n[i] = n[i];

}

address_sort(d);

return mcas_help(d);

}


Concurrent programming without locks

bool mcas_help(mcas_descriptor *d)

{

word *v, desired = FAILED;

bool success;

// Phase 1: acquire

for (int i=0; i<d->N; i++) {

while (TRUE){

v = CCAS(d->a[i], d->e[i], d,

  • &d->status);

    if (v = d->e[i] || v == d) break;

    if (IsMCASDesc(v) )

    mcas_help( (mcas_descriptor *)v );

    else

    goto decision_point;

    }

    }

    desired = SUCCESS;

    decision_point:


Concurrent programming without locks

mcas_help continued

// PHASE 2: read – not used by MCAS

decision_point:

CAS(&d->status, UNDECIDED, desired);

// PHASE 3: clean up

success = (d->status == SUCCESS);

for (int i=0; i<d->N; i++) {

CAS(d->a[i], d, success ? d->n[i] : d->e[i]);

}

return success;

}


Concurrent programming without locks

Word *MCASRead(word **addr)

{

word *v;

retry_read:

v = CCASRead(addr);

if ( !IsMCASDesc(v)) return v;

for (int i=0; i<v->N; i++) {

if (v->addr[i] == addr) {

if (v->status == SUCCESS)

if (CCASRead(addr) == v)

return v->new[i]

else

goto retry_read;

else // FAILED or UNDECIDED

if (CCASRead(addr) == v)

return v->expected[i];

else

goto retry_read;

}

}

return v;

}


Conflicts

Status: UNDECIDED

Address

Old Value

New Value

102

100

200

104

456

789

108

999

777

Status: UNDECIDED

Address

Old Value

New Value

108

999

200

Conflicts

102

104

108


Concurrent programming without locks

bool mcas_help(mcas_descriptor *d)

{

word *v, desired = FAILED;

bool success;

// Phase 1: acquire

for (int i=0; i<d->N; i++) {

while (TRUE){

v = CCAS(d->a[i], d->e[i], d, &d->status);

if (v = d->e[i] || v == d) break;

if (IsMCASDesc(v) )

mcas_help( (mcas_descriptor *)v );

else

goto decision_point;

}

}

desired = SUCCESS;

decision_point:


Conflicts1

Status: UNDECIDED

Address

Old Value

New Value

102

100

200

104

456

789

108

999

777

Status: UNDECIDED

Address

Old Value

New Value

108

999

200

Conflicts

102

104

108


Conflicts2

Status: UNDECIDED

Address

Old Value

New Value

102

100

200

104

456

789

108

999

777

Conflicts

102

104

108

200


Concurrent programming without locks

bool mcas_help(mcas_descriptor *d)

{

word *v, desired = FAILED;

bool success;

// Phase 1: acquire

for (int i=0; i<d->N; i++) {

while (TRUE){

v = CCAS(d->a[i], d->e[i], d, &d->status);

if (v = d->e[i] || v == d) break;

if (!IsMCASDesc(v) ) goto decision_point;

mcas_help( (mcas_descriptor *)v );

}

}

desired = SUCCESS;

decision_point:


Conflicts3

Status: UNDECIDED

Address

Old Value

New Value

102

100

200

104

456

789

108

200

777

Status: UNDECIDED

Address

Old Value

New Value

104

456

123

108

999

200

Conflicts

102

104

108


Concurrent programming without locks

mcas_help continued

// PHASE 2: read – not used by MCAS

decision_point:

CAS(&d->status, UNDECIDED, desired);

// PHASE 3: clean up

success = (d->status == SUCCESS);

for (int i=0; i<d->N; i++) {

CAS(d->a[i], d,

success ? d->n[i] : d->e[i]);

}

return success;

}


Conflicts4

Status: SUCCESS

Address

Old Value

New Value

102

100

200

104

456

789

108

200

777

Status: UNDECIDED

Address

Old Value

New Value

104

456

123

108

999

200

Conflicts

102

104

108


Concurrent programming without locks

mcas_help continued

// PHASE 2: read – not used by MCAS

decision_point:

CAS(&d->status, UNDECIDED, desired);

// PHASE 3: clean up

success = (d->status == SUCCESS);

for (int i=0; i<d->N; i++) {

CAS(d->a[i], d,

success ? d->n[i] : d->e[i]);

}

return success;

}


Conflicts5

Status: SUCCESS

Address

Old Value

New Value

102

100

200

104

456

789

108

200

777

Status: UNDECIDED

Address

Old Value

New Value

104

456

123

108

999

200

Conflicts

200

789

777


Failure modes

Failure Modes

  • Can fail during any of the CAS attempts

    • CCAS

    • CCASHelp


Ccas failure modes

CCAS “failure modes”

  • Someone helped us with the CCAS

    • call CCASHelp with our own descriptor

    • next time around, return MCAS descriptor

    • MCAS continues

  • Someone else beat us to CCAS

    • help them with their CCAS

    • next time around, return their MCAS descriptor

    • Help with their MCAS

    • Our MCAS likely aborts

  • Source value changed

    • return new value

    • MCAS aborts


Concurrent programming without locks

word *CCAS(word **a, word *e, word *n,

word *cond) {

ccas_descriptor *d = new ccas_descriptor();

word *v;

(d->a, d->e, d->n, d->cond) = (a,e,n,cond);

while ( (v = CAS(d->a, d->e, d)) != d->e ) {

if ( !IsCASDesc(v) ) return v;

CCASHelp( (ccas_descriptor *)v);

}

CCASHelp(d);

return v;

}

void CCASHelp(ccas_descriptor *d) {

bool success = (*d->cond == UNDECIDED);

CAS(d->a, d, success ? d->n : d->e);

}


Ccashelp failure modes

CCASHelp “failure modes”

  • MCAS aborted so status isn’t UNDECIDED

    • old value put back in place

  • MCAS aborted, CCASHelp doesn’t restore value

    • MCAS cleanup will put old value back in place

  • Race: status switches to SUCCESS between check and CAS

    • CAS will fail because CCAS descriptor already removed

    • CCAS return will not cause MCAS failure

  • Race: status switches to FAILURE between check and CAS

    • CAS will always fail because for MCAS to fail, someone must have read beyond us


Concurrent programming without locks

Cost

  • Minimum of 3N + 1 CAS instructions for N locations

    • many more CAS under heavy contention !

  • With no contention the three batches of N CAS all act on the same N locations

  • “[improvements] may be useful if there are systems in which CAS operates substantially more slowly than an ordinary write.”


Deep breath

Deep Breath


Concurrent programming without locks

WSTM

  • Remove requirement for space reserved in values being updated

    • hash addr to find ownership record

  • Caller need not keep track of locations

    • read and write sets stored in transaction descriptor

  • Provides read parallelism

  • Obstruction free, not lock free nor wait free


Data structures

Data Structures

100

Orecs

Status: Undecided

200

a1: (100,15) -> (200,16)‏

version52

a2: (200,52) -> (100,53)‏

300

400


Logical contents

Logical contents

  • Orec contains a version number:

    • value comes direct from memory

  • Orec contains a descriptor reference

    • descriptor contains address

      • value comes from descriptor based on status

    • descriptor does not contain address

      • value comes direct from memory


Transaction process

Transaction Process

  • Call WSTMRead/WSTMWrite to gather/change data

    • Builds transaction data structure, but it’s NOT visible

  • WSTMCommitTransaction

    • Get ownership – update ORecs

    • Read-Check – check version numbers

    • Decide

    • Clean up


Data structures1

Data Structures

100

200

version 15

version 16

Status: UNKNOWN

Status: SUCCESS

200

100

a1: (100,15) -> (200,16)

version52

version 53

a2: (200,52) -> (200,52)‏

a2: (200,52) -> (100,53)

300

400


Complications

Complications

  • Fixed number of Orecs

  • Hash collisions lead to false sharing


Issues

Issues

  • Orec ownership acts like a lock, so simple scheme is not even obstruction free

  • Can’t help with “cleanup” because might overwrite newer data

  • Can’t determine value during READCHECK, so we’re forced to shoot down

  • force_decision() might be circular causing live lock

  • helping requires <complicated> stealing of transactions

  • Uncontended cost is N+2 CAS for N locations


Concurrent programming without locks

OSTM

  • Objects are represented as opaque handles

    • can’t use pointers directly

    • must rewrite data structures

  • Get accessible pointers via OSTMOpenForReading/OSTMOpenForWriting

  • Eliminates need for Orecs/aliasing


Evaluation

Evaluation

  • “We use … reference-counting garbage collection”

  • Evaluated with one thread/CPU

  • “Since we know the number of threads participating in our experiments…”


Uncontended performance

Uncontended Performance


Contended locks

Contended Locks


Data contention

Data Contention


Data lock contention

Data/Lock Contention


Spare slides

Spare Slides


Concurrent programming without locks

word WSTMRead(wstm_transaction *tx, word *addr) {

if (entry_exists) return entry->new_value;

if (orec->type != descriptor)‏

create entry [current value, orec version]

else {

force_decision(descriptor); // can’t be ours: not in commit

if (descriptor contains our address)‏

if (status == SUCCESS)‏

create entry [descr.new_val, descr.new_ver]

else

create entry [descr.old_val, descr.old_ver]

else

create entry [current value, descr.aliased.new_ver]

}

if (aliased) {

if (entry->old_version != aliased->old_version)‏

status = FAILED;

entry->old_version = aliased->old_version;

entry->new_version = aliased->new_version;

}

return entry->new_value;

}


Concurrent programming without locks

void WSTMWrite(wstm_transaction *tx,

word *addr, word new_value

{

get entry using WSTMRead logic

entry->new_value = new_value;

for each aliased entry {

entry->new_version++;

}

}


Concurrent programming without locks

bool WSTMCommit(wstm_transaction *tx)

{

if (tx->status == FAILED) return false;

sort descriptor entries

desired_status = FAILED;

for each update

if (!acquire_orec) goto decision_point;

CAS(status, UNDECIDED, READ_CHECK);

for each read

if (!read_check) goto decision_point;

desired_status = SUCCESS;

decision_point:


Concurrent programming without locks

decision_point:

status = tx->status;

while (status != FAILED && status != SUCCESS) {

CAS(tx->status, status, desired_status);

status = tx->status;

}

if (tx->status == SUCCESS)‏

for each update

*addr = entry->new_value;

for each update

release_orec

return (tx->status == SUCCESS);

}


Concurrent programming without locks

bool read_check(wstm_transaction *tx,

wstm_entry *entry)‏

{

if (orec is WSTM_descriptor) {

force_decision()‏

if (SUCCESS)‏

version = new_version;

else

version = old_version

} else {

version = orec_version;

}

return (version == entry->old_version);

}


Data structures2

Data Structures

a1

100

Orecs

Status: Undecided

a2

200

a1: (100,15) -> (200,16)‏

version52

a2: (200,52) -> (100,53)‏

a3: (300,15) -> (300,16)‏

a3

300

400


Caveats

Caveats

  • “It remains possible for a thread to see a mutually inconsistent view of shared memory if it performs a series of [read] calls.”

  • In other words there is not complete isolation between transactions

    • a thread may crash due to concurrency prior to having its transaction abort and retry


  • Login