Sheriff precise detection automatic mitigation of false sharing
Download
1 / 32

Sheriff : Precise Detection & Automatic Mitigation of False Sharing - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Sheriff : Precise Detection & Automatic Mitigation of False Sharing. Tongping Liu , Emery Berger University of Massachusetts, Amherst. Multi-core: expectation is awesome. int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i ++) count[id ]++; }. Reality is awful.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sheriff : Precise Detection & Automatic Mitigation of False Sharing' - oma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Sheriff precise detection automatic mitigation of false sharing

Sheriff:Precise Detection& Automatic Mitigationof False Sharing

Tongping Liu, Emery Berger

University of Massachusetts, Amherst


Multi core expectation is awesome
Multi-core: expectation is awesome

int count[8]; //Global array

thread_func(int id) {

for(i = 0; i < M; i++)

count[id]++;

}


Reality is awful
Reality is awful

int count[8]; //Global array

thread_func(int id) {

for(i = 0; i < M; i++)

count[id]++;

}

13X

count[id]++;

False sharing kills scaling


False sharing performance problem
False sharing = performance problem

Core 2

Core 1

Thread 1

Thread 2

Invalidate

Cache

Cache

Main Memory


False sharing performance problem1
False sharing = performance problem

Core 2

Core 1

Thread 1

Thread 2

20X

slower

Invalidate

Cache

Cache

Main Memory

Interleaved writes cause cache invalidations


False sharing is invisible
False sharing is invisible

me = 1;

you = 1; // globals

me = new Foo;

you = new Bar; // heap

class X {

int me;

int you;

}; // fields

arr[me] = 12;

arr[you] = 13; // array indices


False sharing detector instrument every memory access
False sharing detector: instrument every memory access

Related work:

  • S.M.Guntheret.al. [WBIA 2009].

  • C.Liu. [Master thesis 2009].

  • Q.Zhaoet.al. [VEE2011].

  • Shortcomings:

    • Slow

    • Noactionable output

    • False positives


850 lines
+ 850 lines…

False sharing detector: state of the art

  • Shortcomings:

    • Imprecise

    • Too many false positives

PTU


Sheriff precise detection automatic mitigation of false sharing

No false positives

Efficient (20%)

Actionable output

Object has 13767 interleaving writes.

The object starts at 0xd5c8e160, length 32.

Allocation call stack:

0: word_count.c: 136

1: word_count.c: 444

Sheriff-Detect


Sheriff precise detection automatic mitigation of false sharing

t1 = spawnf(x);

t2 = spawn g(y);

sync;

if (!fork())

f(x);

if (!fork())

g(y);

Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]


S heriff isolated execution
Sheriff: isolated execution

Core 2

Core 1

Process 1

Process 2

Cache

Cache

Process 1

Main Memory

Process 2

Global State


S heriff isolated execution1
Sheriff: isolated execution

Pthreads

Sheriff

1: Lock();

2: XXX;

3: Unlock();

4: YYY;

5: Lock();

Begin_isolated_execution

Begin_isolated_execution

XXX; //isolated execution

YYY; //isolated execution

Commit_local_changes

Commit_local_changes


Snapshot and diffing local changes
Snapshot and diffing: local changes


Sheriff detect find false sharing at commit points
Sheriff-Detect: Find false sharing at commit points

Core 2

Core 1

Process 1

Process 2

Cache

Cache

Process 1

Main Memory

Process 2

Interleaved writes

Global State


Output ptu vs sheriff detect
Output: PTU vs. Sheriff-Detect

kmeans 1916 2

reverse_index N/A 5

Total 2,664 15


Sheriff precise detection automatic mitigation of false sharing

Example case study: linear_regression

Allocation call stack:

0: linear_regression-pthread.c: line number: 136

Step 1: find allocation site

136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs);

Step 2: find references

152: pthread_create(&tid_args[i].tid, &attr,

linear_regression_pthread, (void*)&tid_args[i]) != 0);


Sheriff precise detection automatic mitigation of false sharing

Example case study: linear_regression

void *linear_regression_pthread(void *args_in)

{

lreg_args* args =(lreg_args*)args_in;

……

for (i = 0; i < args->num_elems; i++)

{

args->SX += args->points[i].x;

args->SXX += args->points[i].x*args->points[i].x;

……

“lreg_args” is not aligned


Example case study linear regression
Example case study: linear_regression

Step 3: fix false sharing using padding

typedefstruct {

…..

char padding[128]; // Padding to avoid false sharing

} lreg_args;

9.2X


Sheriff detect performance
Sheriff-Detect performance

11.4

8.2

20%

?


Sheriff precise detection automatic mitigation of false sharing

Speedup due to isolation

Core 2

Core 1

Process 1

Process 2

Cache

Cache

Process 1

Main Memory

Process 2

Global State


Sheriff precise detection automatic mitigation of false sharing

Prevents ALL false sharing

Sheriff-Protect


Basis of sheriff protect
Basis of Sheriff-Protect

-

=

Sheriff-Protect

Sheriff-Detect


Sheriff precise detection automatic mitigation of false sharing

8.2

11.4

13%


Sheriff precise detection automatic mitigation of false sharing

Sheriff libraries: easy to use

Sheriff-Detect

Sheriff-Protect

% g++ myprog.cpp –lsheriffdetect–omyprog

% g++ myprog.cpp–lsheriffprotect–omyprog


Workflow using sheriff
Workflow: using Sheriff

original program

modified program

padding, alignment

local variables

Sheriff-Detect

libpthread

Degrade performance

too much memory

Sheriff-Detect

No source code

No time

No false sharing

Sheriff-

Protect

original program

original program

libpthread

Sheriff-Protect


Sheriff precise detection automatic mitigation of false sharing

8.2

11.4

13%


Why no false positives
Why no false positives?

  • actual interleaved writes (performance problem)

  • Word status – not true sharing

    (3) Avoid heap re-usage problems

    (4) The results of our experiment helps to exemplify the results.


Key optimizations
Key Optimizations

  • Isolate small heap objects and globals

  • Adaptive false sharing prevention

    • Protect on long transaction only


Key optimizations1
Key Optimizations

  • Find sharing pages:

    false sharing objects  shared page

  • Reduce overhead

    • Using sampling

    • Sampling only for long transactions ( > 5ms)