1 / 50

The Barrelfish operating system for CMPs: research issues

The Barrelfish project. Collaboration between ETH Zurich and MSRCAndrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Sch?pbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario S

gyda
Download Presentation

The Barrelfish operating system for CMPs: research issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. The Barrelfish operating system for CMPs: research issues Tim Harris

    2. The Barrelfish project Collaboration between ETH Zurich and MSRC Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Schpbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi

    3. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

    4. Do we need a new OS?

    5. Do we need a new OS?

    6. Do we need a new OS? How might the design of a CMP differ from these existing systems? How might the workloads for a CMP differ from those of existing multi-processor machines?

    7. The cliched single-threaded perf graph

    8. Interactive perf

    9. CC-NUMA architecture

    10. Machine architecture

    11. Machine diversity: AMD 4-core

    12. ...Sun Niagara-2

    13. ...Sun Rock

    15. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

    16. The multikernel model Apps can still use shared memory! Explicit communication via messages: - Means no shared memory (except for the endpoints of communication channels) - Knowledge of what parts of shared state are accessed when and by who is exposed Can analyse Can optimise Can modify - Supports split-phase operations Do useful work, or sleep, while waiting for the reply Hardware neutral: - Architecture-specific parts are confined to The messaging transport The interface to the actual hardware - Easy to plug in different messaging algorithms OS state is replicatedApps can still use shared memory! Explicit communication via messages: - Means no shared memory (except for the endpoints of communication channels) - Knowledge of what parts of shared state are accessed when and by who is exposed Can analyse Can optimise Can modify - Supports split-phase operations Do useful work, or sleep, while waiting for the reply Hardware neutral: - Architecture-specific parts are confined to The messaging transport The interface to the actual hardware - Easy to plug in different messaging algorithms OS state is replicated

    17. Barrelfish: a multikernel OS A new OS architecture for scalable multicore systems Approach: structure the OS as a distributed system Design principles: Make inter-core communication explicit Make OS structure hardware-neutral View state as replicated

    18. #1 Explicit inter-core communication All communication with messages Decouples system structure from inter-core communication mechanism Communication patterns explicitly expressed Better match for future hardware Naturally supports heterogeneous cores, non-coherent interconnects (PCIe) with cheap explicit message passing without cache-coherence (e.g. Intel 80-core) Allows split-phase operations

    19. Communication latency

    20. Communication latency

    21. Message passing vs shared memory Shared memory (move the data to the operation): Each core updates the same memory locations Cache-coherence migrates modified cache lines

    22. Shared memory scaling & latency

    23. Message passing Message passing (move operation to the data): A single server core updates the memory locations Each client core sends RPCs to the server

    24. Message passing

    25. Message passing

    26. #2 Hardware-neutral structure Separate OS structure from hardware Only hardware-specific parts: Message transports (highly optimised / specialised) CPU / device drivers Adaptability to changing performance characteristics Late-bind protocol and message transport implementations

    27. #3 Replicate common state Potentially-shared state accessed as if it were a local replica Scheduler queues, process control blocks, etc. Required by message-passing model Naturally supports domains that do not share memory Naturally supports changes to the set of running cores Hotplug, power management

    28. Replication vs sharing as the default Replicas used as an optimisation in other systems In a multikernel, sharing is a local optimisation Shared (locked) replica on closely-coupled cores Only when faster, as decided at runtime Basic model remains split-phase

    29. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

    30. Applications running on Barrelfish Slide viewer (but not today...) Webserver (www.barrelfish.org) Virtual machine monitor (runs unmodified Linux) Parallel benchmarks: SPLASH-2 OpenMP SQLite ECLiPSe (constraint engine) more. . .

    31. Two hyper-transport requests on AMD 1-way URPC message costs

    32. Local vs remote messaging URPC to a remote core compares favourably with IPC No context switch: TLB unaffected Lower cache impact Higher throughput for pipelined messages

    33. Communication perf: IP loopback 2*2-core AMD system, 1000-byte packets Linux: copy in / out of shared kernel buffers Barrelfish: point-to-point URPC channel

    34. Case study: TLB shoot-down Send a message to every core with a mapping Wait for acks Linux/Windows: Send IPI Spin on shared ack count Barrelfish: Request to local monitor domain 1-phase commit to remote cores Plug in different communication mechanisms

    35. TLB shoot-down: n*unicast

    36. TLB shoot-down: 1*broadcast

    37. Messaging costs

    38. TLB shoot-down: multicast

    39. TLB shoot-down: NUMA-aware mcast

    40. Messaging costs

    41. End-to-end comparative latency

    42. 2-PC pipelining

    43. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

    44. Terminology Domain Protection domain/address space (process) Dispatcher One per domain per core Scheduled by local CPU driver Invokes upcall, which then typically runs a core-local user-level thread scheduler Domain spanning Start instances of a domain on multiple cores cf start affinitized threads

    45. Programming example: domain spanning 1 for i = 1..num_cores-1: 2 create a new dispatcher on core i 3 while (num_dispatchers < num_cores-1): 4 wait for the next message and handle it

    46. Domain spanning: baseline Centralized: Poor scalability, but correct 1021 messages, 487 alloc. RPCs Messages here = locks on conventional OSes Conventional debugging: sampling profiler (everybody sitting in spinlock acquire), aggregate stats on cache misses etc. Specialized tools help, but messages make this a lot easier.Messages here = locks on conventional OSes Conventional debugging: sampling profiler (everybody sitting in spinlock acquire), aggregate stats on cache misses etc. Specialized tools help, but messages make this a lot easier.

    47. Domain spanning: v2 Memory allocation isnt usually thought of as a potential bottleneck on the critical path... We dont have a partitioned memory server (yet), because its quite complicated. Should each core-local memory server receive a statically partitioned memory range? Or should we have a NUMA-aware hierarchy of memory servers? How and when do we adjust the amount of memory that each server has? Whats the overhead in the long term of a partitioned memory server?Memory allocation isnt usually thought of as a potential bottleneck on the critical path... We dont have a partitioned memory server (yet), because its quite complicated. Should each core-local memory server receive a statically partitioned memory range? Or should we have a NUMA-aware hierarchy of memory servers? How and when do we adjust the amount of memory that each server has? Whats the overhead in the long term of a partitioned memory server?

    48. Domain spanning: v3

    49. Domain spanning: v4

    50. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

    51. Current activity Ports to other platforms ARM (32 bit), ongoing Bee3 FPGA platform Better tracing infrastructure Parallel file system Exploration of 1-machine distributed algorithms Programming model Papers and source code http://www.barrelfish.org

More Related