1 / 56

Effect of Context Aware Scheduler on TLB

Effect of Context Aware Scheduler on TLB. Satoshi Yamada and Shigeru Kusakabe Kyushu University. Contents. Introduction Effect of Sibling Threads on TLB Context Aware Scheduler (CAS) Benchmark Applications and Measurement Environment Result Related Work Conclusion. Contents.

Download Presentation

Effect of Context Aware Scheduler on TLB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effect of Context Aware Scheduler on TLB Satoshi Yamada and Shigeru Kusakabe Kyushu University

  2. Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

  3. Contents • Introduction • What is Context? • Motivation • Task Switch and Cache • Approach of our Scheduler • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

  4. What is context ? • Definition in this presentation Context = Memory Address Space • Task switch • Context switch

  5. Motivation • More chances of using native threads in OS today • Java, Perl, Python, Erlang, and Ruby • OpenMP, MPI • The more threads increase, the heavier the overhead due to a task switch tends to get • Agarwal, et al. “Cache performance of operating system and multiprogramming workloads” (1988)

  6. Switch Switch Task Switch and Cache • Overhead due a task switch • includes that of loading a working set of next process • is deeply related with the utilization of caches • Mogul, et al. “The effect of of context switches on cache performance” (1991) Working set of A Working sets overflows the cache Working set of B Working set of A Working set of B Process B Process A Cache

  7. Approach of our Scheduler • Three solutions to reduce the overhead due to task switches • Agarwal, et al. “Cache performance of operating system and multiprogramming workloads” (1988) • Increase the size of caches • Reuse the shared date among threads • Utilize tagged caches and/or restrain cache flushes * We utilize sibling threads to achieve 2. and 3. * We mainly discuss on 3.

  8. Contents • Introduction • Effect of Sibling Threads on TLB • Working Set and Task Switch • TLB tag and Task Switch • Advantage of Sibling Threads • Effect of Sibling Threads on Task Switches • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

  9. Working set of A Working set of B Working set of A Working set of A & B Switch Switch Switch Switch Working set of B Cache Cache Working Set and Task Switch • Task Switch with small overhead • Task Switch with large overhead Process B Process A Working set of B Working set of A Process B Process A

  10. TLB and Task Switch Tagged TLB Non - Tagged TLB 2056 496 0x0123 0xc567 0x23ab 0xcea4 0x3614 0xc345 0x8a24 0xcacd 0x0123 0x0a67 0x23ab 0x0aa4 0x3614 0x0a45 0x8a24 0x0acd • Tagged TLB: TLB flush is not necessary (ARM, MIPS, etc) • Non-tagged TLB: TLB flush is necessary(x86, etc)

  11. Child clone() Child task_struct task_struct mm_struct mm_struct mm signal file . . mm signal file . . mm signal file . . mm signal file . . share copy mm signal file . . signal_struct signal_struct . . . . . . Sibling Threads Advantage of Sibling Threads Parent Parent fork() task_struct task_struct mm_struct mm signal file . . mm signal file . . signal_struct signal_struct . . create a THREAD create a PROCESS • Advantage on task switches • Higher possibility of sharing data among sibling threads • Context switch does not happen • Restrain TLB flushes in non-tagged TLB

  12. switch switch switch switch Process switch switch switch switch Process Working set Sibling Thread Sibling Thread Effect of Sibling Threads on Task SwitchesMeasurement We use the idea of lat_ctx program in LMbench

  13. Effect of Sibling Threads on Task SwitchesResults (sibling threads / process)

  14. Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • O(1) Scheduler in Linux • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

  15. O(1) Scheduler in Linux • Structure • active queue and expired queue • priority bitmap and array of linked list of threads • Behavior • search priority bitmap and choose a thread with the highest priority • Scheduling overhead • independent of the number of threads bitmap bitmap high A 1 1 B 0 1 C 1 0 D 0 0 low 0 0 active expired Processor

  16. auxiliary runqueues per context A B 1 1 Preg 1 1 C D 1 1 E Paux 0 0 Context Aware Scheduler (CAS) (1/2) regular O(1) scheduler runqueue A B 1 0 1 C E D 0 • CAS creates auxiliary runqueues per context • CAS compares Preg and Paux • Preg: the highest priority in regular O(1) scheduler runqueue • Paux: the highest priority in the auxiliary runqueue • if Preg - Paux ≦ threshold, then we choose Paux

  17. Context Aware Scheduler (CAS) (2/2) regular O(1) scheduler runqueue auxiliary runqueues per context A B B A 1 1 1 1 1 0 C E 1 1 D 1 C D E 0 0 0 A C E B D CAS with threshold 2 context switch:1 time O(1) scheduler A B C D E context switch:4 times

  18. Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Measurement Environment • Benchmarks • Measurements • Scheduler • Result • Related Work • Conclusion

  19. Measurement Environment • Intel Core 2 Duo 1.86 GHz Spec of each memory hierarchy

  20. Benchmarks

  21. Elapsed Time of each application process time of chat = chat 0 + chat 1 + … + chat M Process Time of each application chat 0 SysBench 0 Volano 0 DaCapo 0 chat 1 SysBench 1 Volano 1 DaCapo 1 chat M SysBench N Volano X DaCapo Y Measurements Chat SysBench Volano DaCapo DTLB and ITLB misses (user/kernel spaces) Elapsed Time of executing 4 applications

  22. Scheduler • O(1) scheduler in Linux 2.6.21 • CAS • threshold 1 • threshold 10

  23. Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • TLB misses • Process Time • Elapsed Time • Comparison between Completely Fair Scheduler • Related Work • Conclusion

  24. TLB misses (million times)

  25. A B A B E C D E I F C D G F H G H I A F H G I C E B D A G H F I E B C D Why larger threshold better? 1 larger threshold can aggregate more 0 0 0 1 Dynamic priority works against small threshold 0

  26. Process Time (seconds)

  27. Elapsed Time (seconds)

  28. Comparison between Completely Fair Scheduler (CFS) • What is CFS? • Introduced from Linux 2.6.23 • Cut off the heuristic calculation of dynamic priority • Not consider the address space in scheduling • Why compare? • Investigate if applying CAS into CFS is valuable • CAS idea can reduce TLB misses and process time in CFS?

  29. TLB misses

  30. Process Time and Total Elapsed Time (seconds)

  31. Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

  32. Sujay Parekh, et. al,“Thread Sensitive Scheduling for SMT Processors” (2000) • Parekh’s scheduler • tries groups of threads to execute in parallel and sample the information about • IPC • TLB misses • L2 cache misses, etc • schedules on the information sampled Sampling Phase Scheduling Phase Sampling Phase Scheduling Phase

  33. Contents • Introduction • Effect of Sibling Threads on TLB • Context Aware Scheduler (CAS) • Benchmark Applications and Measurement Environment • Result • Related Work • Conclusion

  34. Conclusion • Conclusion • CAS is effective in reducing TLB misses • CAS enhances the throughput of every application • Future Works • Evaluation on other architectures • Applying CAS into CFS scheduler • Extension to SMP platforms

  35. additional slides

  36. Effect of sibling threads on context switches (counts)

  37. Result of Cache Misses (thousand times)

  38. Result of Cache Misses (thousand times)

  39. Memory Consumption of CAS • Additional memory consumption of CAS • About 40 bytes per thread • About 150 K bytes per thread group • 6 * 150 K + 1700 * 40 = 970K

  40. Effective and Ineffective Case of CAS • Effective case • Consecutive threads share certain amount of data • Ineffective case • Consecutive threads do not share data cache Working set of B Working set of A cache Working set of B Working set of A

  41. Pranay Koka, et. al, “Opportunities for Cache Friendly Process” (2005) • Koka’s scheduler • traces the execution of each thread • puts the focus on the shared memory space between threads Tracing Phase Scheduling Phase Tracing Phase Scheduling Phase

  42. Extension to SMP • Aggregation into limited processors CPU 0 CPU 1

  43. Extension to SMP • Execute threads with the same address space in parallel CPU 0 CPU 1

  44. TLB misses and Total Elapsed Time

  45. widely spread multithreading ThreadA ThreadB • Multithreading hides the latency of disk I/O and network access • Threads in many languages, Java, Perl, and Python correspond to OS threads ThreadB waits disk * More context switches happen today * Process scheduler in OS is more responsible for the system performance

  46. Context Aware (CA) scheduler Our CA scheduler aggregates sibling threads Linux O(1) scheduler CA scheduler A B C D E Context switches between processes:3 times A C D B E Context switches between processes:1 time

  47. Results of Context Switch (micro seconds) Process C Process A 2MB L2 cache size: 2MB Process B 1MB Cache 0

  48. Overhead due to a context switch by lat_ctx in LMbench

  49. Fairness bitmap bitmap • O(1) scheduler keeps the fairness by epoch • cycles of active queue and expired queue • CA scheduler also follows epoch • guarantee the same level of fairness as O(1) scheduler A 1 1 B 0 1 C 1 1 D 0 0 0 0 active expired Processor 0

More Related