1 / 26

Corey: An Operating System for Many Cores

Corey: An Operating System for Many Cores. Silias Boyd-Wickizerr, Haibo Chen, RongChen, Yandong Mao, Frans Kaashoek, Roberrt Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, Zheng Zhang MIT, Fudan University, Microsoft Research Asia, Xi’an Jiaotong University 2008 SOSP.

denim
Download Presentation

Corey: An Operating System for Many Cores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corey: An Operating System for Many Cores Silias Boyd-Wickizerr, Haibo Chen, RongChen, Yandong Mao, Frans Kaashoek, Roberrt Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, Zheng Zhang MIT, Fudan University, Microsoft Research Asia, Xi’an Jiaotong University 2008 SOSP 강동우 redcdang@gmail.com

  2. Agenda • Introduction • Motivation • Design • Evaluation • Conclusion

  3. Introduction • Most PCs have or will have multicore chips • Cache-coherent shared memory hardware is the new standard • Performance of some OS services scales very poorly with number of cores/processors

  4. Motivation • Scalability Problem • Benchmark • Number of threads within a process • Each thread cresates a file descriptor, and then each thread repeatedly duplicates(dup) and close • 4 Quad-Core AMD Opteron , Linux 2.6.25

  5. Motivation • AMD 16core System Topology Takes More time to access L1 or L2 cache of another core than accessing shared L3 Cache Takes more time to access Caches in another chip than local caches Kernels must mainly access data in the local core’s cache

  6. MapReduce • Map phase: • Processes read parts of application’s inputs • Generate intermediary results and store them locally • Reduce phase: • Processors collate results produced by multiple map instances • Produce the output

  7. Design • Goal • Allow Application to control Shared Resources • 3 Absractions • Address Range Abstraction • Control page talbes and the kernel data used to manage them • Kernel Core Abstraction • Allow applications to dedicate cores to running particular kernel functions. • Shares Abstraction • Control the kernel data used to resolve application references

  8. Design - Address Range Abstraction • Current 2 Types of Address Space • Single Address Space • Multiple Threads • Separate Address Space • Multiple Processes • Share memory with mmap Extra Soft Page Faults Contention Map is bad, Reduce is good Map is good, Reduce is Bad

  9. Design - Address Range Abstraction • Address Ranges • to give applications high performance for both private and shared memory • kernel-provided abstraction that corresponds to a range of virtual-to-physical mappings • mm_struct Non Contention on Map Phase No Page Fault, Because share the hardware page tables

  10. Design – Kernel Core Abstraction • System calls are executed on the core of the invoking process • if the system call needs to access large shared data structures, it is not good.( Many cache miss ) • Applications can dedicate cores to kernel functions and data • Kernel Core can manage hardware devices and execute system calls sent from other cores.

  11. Design – Shares Abstraction • File Descriptors, Process ID • Many kernel operations involve looking up identifieres in tables to yield a pointer to the relevant kernel data structure • Shared FD table is a bottleneck • Shares • Applications can control how cores share the kernel data structures used to do lookups

  12. Implementation • Low-Level Kernel • Architecture specific functions, Device Drivers, • 11,000 Lines of C, 150 Lines of Assembly • Unix like Environment • 11,000 Lines of C/C++, • Buffer cache, cfork, TCP/IP Stack Interface(lwIP)

  13. Performance Evaluation • AMD 16-Core System • 8GB Memory • Linux kernel 2.6.25 • Pin one thread to each core • Intel Pro/1000 Ethernet Device

  14. Evaluation - Address Range Abstraction • memclone • private memory • Each core allocate its own 100 MB array and modify each page of the array • mempass • shared memory • Allocates a single 100MB array on one of the clones, touches each buffer page and passes it to the next core which repeats the process

  15. Evaluation – Kernel Cores Abstraction • Simple TCP Service • Dedicated • use a kernel core for all network processing • Polling • use a kernel core only to poll for packet notifications and transmit completions

  16. Evaluation – Shares Absraction • 2 Microbenchmarks • Each core calls share_addobj() to add a per core segment to a global share then calls share_delobj() to delete that segment • same but per core segment is added to a local share

  17. Evalution - Applications • MapReduce • wri MapReduce Application • …Maybe Word Count… • 1GB File

  18. Evalution - Applications • Increase performance by dedicating application data to cores • webd application called filesum • return the sum of the bytes in that file • Random mode, Locality mode

  19. Conclusions • Applications should be scaleable on Multicore Architectures • Corey is a new kernel • Address Range, Kernel Core, Share • Show that Can avoid scalability bottlenecks • MapReduce and Web Application

  20. Backup Slides

  21. cfork • cfork(core_id) is an extension of UNIX fork() that creates a new process (pcore) on core core_id • Application can specify multiple levels of sharing between parent and child • Default is copy-on-write

  22. Network • Applications can decide to run • Multiple network stacks • A single shared network stack

  23. Buffer Cache • Shared buffer like regular UNIX buffer cache • Three modifications • A lock-free tree allows multiple cores to locate cached blocks w/o contention • A write scheme tries to minimize contention • A scalable read/write lock

  24. Splin Locks , MCS Locks

  25. MCS Lock • FIFO ordering of lock acquisitions • Critical Section에 접근하려는 Task는 자신을 empty queue에 삽입. • 락을 해제할 때 task는 다음에 사용할 task를 지정 • queue에서 바로 뒤에 들어온 task를 지정 • 장점 • local spin(bus traffic이 적다) • task들간의 contention이 없고, 정해진 순서로 lock을 획득 • waiting time이 일정 시간으로 bound

  26. Thread A on Core 0 Thread B on Core 1 root share root share perprocess share fd fd fd paper.pdf text.txt shared_avi.avi Private Share Shared Share

More Related