Effect of Context Aware Scheduler on TLB

Effect of Context Aware Scheduler on TLB Satoshi Yamada and Shigeru Kusakabe Kyushu University

Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

Contents • Introduction • What is Context? • Motivation • Task Switch and Cache • Approach of our Scheduler • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

What is context ? • Definition in this presentation Context = Memory Address Space • Task switch • Context switch

Motivation • More chances of using native threads in OS today • Java, Perl, Python, Erlang, and Ruby • OpenMP, MPI • The more threads increase, the heavier the overhead due to a task switch tends to get • Agarwal, et al. “Cache performance of operating system and multiprogramming workloads” (1988)

Switch Switch Task Switch and Cache • Overhead due a task switch • includes that of loading a working set of next process • is deeply related with the utilization of caches • Mogul, et al. “The effect of of context switches on cache performance” (1991) Working set of A Working sets overflows the cache Working set of B Working set of A Working set of B Process B Process A Cache

Approach of our Scheduler • Three solutions to reduce the overhead due to task switches • Agarwal, et al. “Cache performance of operating system and multiprogramming workloads” (1988) • Increase the size of caches • Reuse the shared date among threads • Utilize tagged caches and/or restrain cache flushes ＊　We utilize sibling threads to achieve 2. and 3. ＊　We mainly discuss on 3.

Contents • Introduction • Effect of Sibling Threads on TLB • Working Set and Task Switch • TLB tag and Task Switch • Advantage of Sibling Threads • Effect of Sibling Threads on Task Switches • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

Working set of A Working set of B Working set of A Working set of A & B Switch Switch Switch Switch Working set of B Cache Cache Working Set and Task Switch • Task Switch with small overhead • Task Switch with large overhead Process B Process A Working set of B Working set of A Process B Process A

TLB and Task Switch Tagged TLB Non - Tagged TLB 2056 496 0x0123 0xc567 0x23ab 0xcea4 0x3614 0xc345 0x8a24 0xcacd 0x0123 0x0a67 0x23ab 0x0aa4 0x3614 0x0a45 0x8a24 0x0acd • Tagged TLB: TLB flush is not necessary (ARM, MIPS, etc) • Non-tagged TLB: TLB flush is necessary(x86, etc)

Child clone() Child task_struct task_struct mm_struct mm_struct mm signal file . . mm signal file . . mm signal file . . mm signal file . . share copy mm signal file . . signal_struct signal_struct . . . . . . Sibling Threads Advantage of Sibling Threads Parent Parent fork() task_struct task_struct mm_struct mm signal file . . mm signal file . . signal_struct signal_struct . . create a THREAD create a PROCESS • Advantage on task switches • Higher possibility of sharing data among sibling threads • Context switch does not happen • Restrain TLB flushes in non-tagged TLB

switch switch switch switch Process switch switch switch switch Process Working set Sibling Thread Sibling Thread Effect of Sibling Threads on Task SwitchesMeasurement We use the idea of lat_ctx program in LMbench

Effect of Sibling Threads on Task SwitchesResults (sibling threads / process)

Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • O(1) Scheduler in Linux • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

O(1) Scheduler in Linux • Structure • active queue and expired queue • priority bitmap and array of linked list of threads • Behavior • search priority bitmap and choose a thread with the highest priority • Scheduling overhead • independent of the number of threads bitmap bitmap high A 1 1 B 0 1 C 1 0 D 0 0 low 0 0 active expired Processor

auxiliary runqueues per context A B 1 1 Preg 1 1 C D 1 1 E Paux 0 0 Context Aware Scheduler (CAS) (1/2) regular O(1) scheduler runqueue A B 1 0 1 C E D 0 • CAS creates auxiliary runqueues per context • CAS compares Preg and Paux • Preg: the highest priority in regular O(1) scheduler runqueue • Paux: the highest priority in the auxiliary runqueue • if Preg - Paux ≦ threshold, then we choose Paux

Context Aware Scheduler (CAS) (2/2) regular O(1) scheduler runqueue auxiliary runqueues per context A B B A 1 1 1 1 1 0 C E 1 1 D 1 C D E 0 0 0 A C E B D CAS with threshold 2 context switch：1 time O(1) scheduler A B C D E context switch：4 times

Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Measurement Environment • Benchmarks • Measurements • Scheduler • Result • Related Work • Conclusion

Measurement Environment • Intel Core 2 Duo 1.86 GHz Spec of each memory hierarchy

Benchmarks

Elapsed Time of each application process time of chat = chat 0 + chat 1 + … + chat M Process Time of each application chat 0 SysBench 0 Volano 0 DaCapo 0 chat 1 SysBench 1 Volano 1 DaCapo 1 chat M SysBench N Volano X DaCapo Y Measurements Chat SysBench Volano DaCapo DTLB and ITLB misses (user/kernel spaces) Elapsed Time of executing 4 applications

Scheduler • O(1) scheduler in Linux 2.6.21 • CAS • threshold 1 • threshold 10

Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • TLB misses • Process Time • Elapsed Time • Comparison between Completely Fair Scheduler • Related Work • Conclusion

TLB misses (million times)

A B A B E C D E I F C D G F H G H I A F H G I C E B D A G H F I E B C D Why larger threshold better? 1 larger threshold can aggregate more 0 0 0 1 Dynamic priority works against small threshold 0

Process Time (seconds)

Elapsed Time (seconds)

Comparison between Completely Fair Scheduler (CFS) • What is CFS? • Introduced from Linux 2.6.23 • Cut off the heuristic calculation of dynamic priority • Not consider the address space in scheduling • Why compare? • Investigate if applying CAS into CFS is valuable • CAS idea can reduce TLB misses and process time in CFS?

TLB misses

Process Time and Total Elapsed Time (seconds)

Sujay Parekh, et. al,“Thread Sensitive Scheduling for SMT Processors” (2000) • Parekh’s scheduler • tries groups of threads to execute in parallel and sample the information about • IPC • TLB misses • L2 cache misses, etc • schedules on the information sampled Sampling Phase Scheduling Phase Sampling Phase Scheduling Phase

Conclusion • Conclusion • CAS is effective in reducing TLB misses • CAS enhances the throughput of every application • Future Works • Evaluation on other architectures • Applying CAS into CFS scheduler • Extension to SMP platforms

additional slides

Effect of sibling threads on context switches (counts)

Result of Cache Misses (thousand times)

Memory Consumption of CAS • Additional memory consumption of CAS • About 40 bytes per thread • About 150 K bytes per thread group • 6 * 150 K + 1700 * 40 = 970K

Effective and Ineffective Case of CAS • Effective case • Consecutive threads share certain amount of data • Ineffective case • Consecutive threads do not share data cache Working set of B Working set of A cache Working set of B Working set of A

Pranay Koka, et. al, “Opportunities for Cache Friendly Process” (2005) • Koka’s scheduler • traces the execution of each thread • puts the focus on the shared memory space between threads Tracing Phase Scheduling Phase Tracing Phase Scheduling Phase

Extension to SMP • Aggregation into limited processors CPU 0 CPU 1

Extension to SMP • Execute threads with the same address space in parallel CPU 0 CPU 1

TLB misses and Total Elapsed Time

widely spread multithreading ThreadA ThreadB • Multithreading hides the latency of disk I/O and network access • Threads in many languages, Java, Perl, and Python correspond to OS threads ThreadB waits disk ＊　More context switches happen today ＊　Process scheduler in OS is more responsible for the system performance

Context Aware (CA) scheduler Our CA scheduler aggregates sibling threads Linux O(1) scheduler CA scheduler A B C D E Context switches between processes：3 times A C D B E Context switches between processes：1 time

Results of Context Switch (micro seconds) Process C Process A 2MB L2 cache size: 2MB Process B 1MB Cache 0

Overhead due to a context switch by lat_ctx in LMbench

Fairness bitmap bitmap • O(1) scheduler keeps the fairness by epoch • cycles of active queue and expired queue • CA scheduler also follows epoch • guarantee the same level of fairness as O(1) scheduler A 1 1 B 0 1 C 1 1 D 0 0 0 0 active expired Processor 0

Effect of Context Aware Scheduler on TLB