330 likes | 473 Views
False sharing is a performance issue that severely impacts multi-core system scalability. This paper presents Sheriff, a novel solution for detecting and automatically mitigating false sharing. By instrumenting memory access, Sheriff eliminates false sharing with high precision and low overhead, providing actionable output and no false positives. Performance benchmarks demonstrate significant improvements in isolation execution and shared memory accesses. Our findings indicate that effective management of false sharing can enhance performance and scalability across concurrent applications.
E N D
Sheriff:Precise Detection& Automatic Mitigationof False Sharing Tongping Liu, Emery Berger University of Massachusetts, Amherst
Multi-core: expectation is awesome int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; }
Reality is awful int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; } 13X count[id]++; False sharing kills scaling
False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 Invalidate Cache Cache Main Memory
False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 20X slower Invalidate Cache Cache Main Memory Interleaved writes cause cache invalidations
False sharing is invisible me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields arr[me] = 12; arr[you] = 13; // array indices
False sharing detector: instrument every memory access Related work: • S.M.Guntheret.al. [WBIA 2009]. • C.Liu. [Master thesis 2009]. • Q.Zhaoet.al. [VEE2011]. • Shortcomings: • Slow • Noactionable output • False positives
+ 850 lines… False sharing detector: state of the art • Shortcomings: • Imprecise • Too many false positives PTU
No false positives Efficient (20%) Actionable output Object has 13767 interleaving writes. The object starts at 0xd5c8e160, length 32. Allocation call stack: 0: word_count.c: 136 1: word_count.c: 444 Sheriff-Detect
t1 = spawnf(x); t2 = spawn g(y); sync; if (!fork()) f(x); if (!fork()) g(y); Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]
Sheriff: isolated execution Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State
Sheriff: isolated execution Pthreads Sheriff 1: Lock(); 2: XXX; 3: Unlock(); 4: YYY; 5: Lock(); Begin_isolated_execution Begin_isolated_execution XXX; //isolated execution YYY; //isolated execution Commit_local_changes Commit_local_changes
Sheriff-Detect: Find false sharing at commit points Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Interleaved writes Global State
Output: PTU vs. Sheriff-Detect kmeans 1916 2 reverse_index N/A 5 Total 2,664 15
Example case study: linear_regression Allocation call stack: 0: linear_regression-pthread.c: line number: 136 Step 1: find allocation site 136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs); Step 2: find references 152: pthread_create(&tid_args[i].tid, &attr, linear_regression_pthread, (void*)&tid_args[i]) != 0);
Example case study: linear_regression void *linear_regression_pthread(void *args_in) { lreg_args* args =(lreg_args*)args_in; …… for (i = 0; i < args->num_elems; i++) { args->SX += args->points[i].x; args->SXX += args->points[i].x*args->points[i].x; …… “lreg_args” is not aligned
Example case study: linear_regression Step 3: fix false sharing using padding typedefstruct { ….. char padding[128]; // Padding to avoid false sharing } lreg_args; 9.2X
Sheriff-Detect performance 11.4 8.2 20% ?
Speedup due to isolation Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State
Prevents ALL false sharing Sheriff-Protect
Basis of Sheriff-Protect - = Sheriff-Protect Sheriff-Detect
8.2 11.4 13%
Sheriff libraries: easy to use Sheriff-Detect Sheriff-Protect % g++ myprog.cpp –lsheriffdetect–omyprog % g++ myprog.cpp–lsheriffprotect–omyprog
Workflow: using Sheriff original program modified program padding, alignment local variables Sheriff-Detect libpthread Degrade performance too much memory Sheriff-Detect No source code No time No false sharing Sheriff- Protect original program original program libpthread Sheriff-Protect
8.2 11.4 13%
Why no false positives? • actual interleaved writes (performance problem) • Word status – not true sharing (3) Avoid heap re-usage problems (4) The results of our experiment helps to exemplify the results.
Key Optimizations • Isolate small heap objects and globals • Adaptive false sharing prevention • Protect on long transaction only
Key Optimizations • Find sharing pages: false sharing objects shared page • Reduce overhead • Using sampling • Sampling only for long transactions ( > 5ms)