sheriff precise detection automatic mitigation of false sharing
Download
Skip this Video
Download Presentation
Sheriff : Precise Detection & Automatic Mitigation of False Sharing

Loading in 2 Seconds...

play fullscreen
1 / 32

Sheriff : Precise Detection & Automatic Mitigation of False Sharing - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Sheriff : Precise Detection & Automatic Mitigation of False Sharing. Tongping Liu , Emery Berger University of Massachusetts, Amherst. Multi-core: expectation is awesome. int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i ++) count[id ]++; }. Reality is awful.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Sheriff : Precise Detection & Automatic Mitigation of False Sharing' - oma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sheriff precise detection automatic mitigation of false sharing

Sheriff:Precise Detection& Automatic Mitigationof False Sharing

Tongping Liu, Emery Berger

University of Massachusetts, Amherst

multi core expectation is awesome
Multi-core: expectation is awesome

int count[8]; //Global array

thread_func(int id) {

for(i = 0; i < M; i++)

count[id]++;

}

reality is awful
Reality is awful

int count[8]; //Global array

thread_func(int id) {

for(i = 0; i < M; i++)

count[id]++;

}

13X

count[id]++;

False sharing kills scaling

false sharing performance problem
False sharing = performance problem

Core 2

Core 1

Thread 1

Thread 2

Invalidate

Cache

Cache

Main Memory

false sharing performance problem1
False sharing = performance problem

Core 2

Core 1

Thread 1

Thread 2

20X

slower

Invalidate

Cache

Cache

Main Memory

Interleaved writes cause cache invalidations

false sharing is invisible
False sharing is invisible

me = 1;

you = 1; // globals

me = new Foo;

you = new Bar; // heap

class X {

int me;

int you;

}; // fields

arr[me] = 12;

arr[you] = 13; // array indices

false sharing detector instrument every memory access
False sharing detector: instrument every memory access

Related work:

  • S.M.Guntheret.al. [WBIA 2009].
  • C.Liu. [Master thesis 2009].
  • Q.Zhaoet.al. [VEE2011].
  • Shortcomings:
    • Slow
    • Noactionable output
    • False positives
850 lines
+ 850 lines…

False sharing detector: state of the art

  • Shortcomings:
    • Imprecise
    • Too many false positives

PTU

slide9

No false positives

Efficient (20%)

Actionable output

Object has 13767 interleaving writes.

The object starts at 0xd5c8e160, length 32.

Allocation call stack:

0: word_count.c: 136

1: word_count.c: 444

Sheriff-Detect

slide11

t1 = spawnf(x);

t2 = spawn g(y);

sync;

if (!fork())

f(x);

if (!fork())

g(y);

Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]

s heriff isolated execution
Sheriff: isolated execution

Core 2

Core 1

Process 1

Process 2

Cache

Cache

Process 1

Main Memory

Process 2

Global State

s heriff isolated execution1
Sheriff: isolated execution

Pthreads

Sheriff

1: Lock();

2: XXX;

3: Unlock();

4: YYY;

5: Lock();

Begin_isolated_execution

Begin_isolated_execution

XXX; //isolated execution

YYY; //isolated execution

Commit_local_changes

Commit_local_changes

sheriff detect find false sharing at commit points
Sheriff-Detect: Find false sharing at commit points

Core 2

Core 1

Process 1

Process 2

Cache

Cache

Process 1

Main Memory

Process 2

Interleaved writes

Global State

output ptu vs sheriff detect
Output: PTU vs. Sheriff-Detect

kmeans 1916 2

reverse_index N/A 5

Total 2,664 15

slide17

Example case study: linear_regression

Allocation call stack:

0: linear_regression-pthread.c: line number: 136

Step 1: find allocation site

136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs);

Step 2: find references

152: pthread_create(&tid_args[i].tid, &attr,

linear_regression_pthread, (void*)&tid_args[i]) != 0);

slide18

Example case study: linear_regression

void *linear_regression_pthread(void *args_in)

{

lreg_args* args =(lreg_args*)args_in;

……

for (i = 0; i < args->num_elems; i++)

{

args->SX += args->points[i].x;

args->SXX += args->points[i].x*args->points[i].x;

……

“lreg_args” is not aligned

example case study linear regression
Example case study: linear_regression

Step 3: fix false sharing using padding

typedefstruct {

…..

char padding[128]; // Padding to avoid false sharing

} lreg_args;

9.2X

slide21

Speedup due to isolation

Core 2

Core 1

Process 1

Process 2

Cache

Cache

Process 1

Main Memory

Process 2

Global State

basis of sheriff protect
Basis of Sheriff-Protect

-

=

Sheriff-Protect

Sheriff-Detect

slide24

8.2

11.4

13%

slide25

Sheriff libraries: easy to use

Sheriff-Detect

Sheriff-Protect

% g++ myprog.cpp –lsheriffdetect–omyprog

% g++ myprog.cpp–lsheriffprotect–omyprog

workflow using sheriff
Workflow: using Sheriff

original program

modified program

padding, alignment

local variables

Sheriff-Detect

libpthread

Degrade performance

too much memory

Sheriff-Detect

No source code

No time

No false sharing

Sheriff-

Protect

original program

original program

libpthread

Sheriff-Protect

slide29

8.2

11.4

13%

why no false positives
Why no false positives?
  • actual interleaved writes (performance problem)
  • Word status – not true sharing

(3) Avoid heap re-usage problems

(4) The results of our experiment helps to exemplify the results.

key optimizations
Key Optimizations
  • Isolate small heap objects and globals
  • Adaptive false sharing prevention
    • Protect on long transaction only
key optimizations1
Key Optimizations
  • Find sharing pages:

false sharing objects  shared page

  • Reduce overhead
    • Using sampling
    • Sampling only for long transactions ( > 5ms)
ad