The Barrelfish operating system for CMPs: research issues

1. The Barrelfish operating system for CMPs: research issues Tim Harris

2. The Barrelfish project Collaboration between ETH Zurich and MSRC Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Sch�pbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi

3. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

4. Do we need a new OS?

5. Do we need a new OS?

6. Do we need a new OS? How might the design of a CMP differ from these existing systems? How might the workloads for a CMP differ from those of existing multi-processor machines?

7. The cliched single-threaded perf graph

8. Interactive perf

9. CC-NUMA architecture

10. Machine architecture

11. Machine diversity: AMD 4-core

12. ...Sun Niagara-2

13. ...Sun Rock

15. Introduction Hardware and workloads Multikernel design principles Communication costs Starting a domain

16. The multikernel model Apps can still use shared memory! Explicit communication via messages: - Means no shared memory (except for the endpoints of communication channels) - Knowledge of what parts of shared state are accessed when and by who is exposed Can analyse Can optimise Can modify - Supports split-phase operations Do useful work, or sleep, while waiting for the reply Hardware neutral: - Architecture-specific parts are confined to The messaging transport The interface to the actual hardware - Easy to plug in different messaging algorithms OS state is replicatedApps can still use shared memory! Explicit communication via messages: - Means no shared memory (except for the endpoints of communication channels) - Knowledge of what parts of shared state are accessed when and by who is exposed Can analyse Can optimise Can modify - Supports split-phase operations Do useful work, or sleep, while waiting for the reply Hardware neutral: - Architecture-specific parts are confined to The messaging transport The interface to the actual hardware - Easy to plug in different messaging algorithms OS state is replicated

17. Barrelfish: a multikernel OS A new OS architecture for scalable multicore systems Approach: structure the OS as a distributed system Design principles: Make inter-core communication explicit Make OS structure hardware-neutral View state as replicated

18. #1 Explicit inter-core communication All communication with messages Decouples system structure from inter-core communication mechanism Communication patterns explicitly expressed Better match for future hardware Naturally supports heterogeneous cores, non-coherent interconnects (PCIe) with cheap explicit message passing without cache-coherence (e.g. Intel 80-core) Allows split-phase operations

19. Communication latency

20. Communication latency

21. Message passing vs shared memory Shared memory (move the data to the operation): Each core updates the same memory locations Cache-coherence migrates modified cache lines

22. Shared memory scaling & latency

23. Message passing Message passing (move operation to the data): A single server core updates the memory locations Each client core sends RPCs to the server

24. Message passing

25. Message passing

26. #2 Hardware-neutral structure Separate OS structure from hardware Only hardware-specific parts: Message transports (highly optimised / specialised) CPU / device drivers Adaptability to changing performance characteristics Late-bind protocol and message transport implementations

27. #3 Replicate common state Potentially-shared state accessed as if it were a local replica Scheduler queues, process control blocks, etc. Required by message-passing model Naturally supports domains that do not share memory Naturally supports changes to the set of running cores Hotplug, power management

28. Replication vs sharing as the default Replicas used as an optimisation in other systems In a multikernel, sharing is a local optimisation Shared (locked) replica on closely-coupled cores Only when faster, as decided at runtime Basic model remains split-phase


30. Applications running on Barrelfish Slide viewer (but not today...) Webserver (www.barrelfish.org) Virtual machine monitor (runs unmodified Linux) Parallel benchmarks: SPLASH-2 OpenMP SQLite ECLiPSe (constraint engine) more. . .

31. Two hyper-transport requests on AMD 1-way URPC message costs

32. Local vs remote messaging URPC to a remote core compares favourably with IPC No context switch: TLB unaffected Lower cache impact Higher throughput for pipelined messages

33. Communication perf: IP loopback 2*2-core AMD system, 1000-byte packets Linux: copy in / out of shared kernel buffers Barrelfish: point-to-point URPC channel

34. Case study: TLB shoot-down Send a message to every core with a mapping Wait for acks Linux/Windows: Send IPI Spin on shared ack count Barrelfish: Request to local monitor domain 1-phase commit to remote cores Plug in different communication mechanisms

35. TLB shoot-down: n*unicast

36. TLB shoot-down: 1*broadcast

37. Messaging costs

38. TLB shoot-down: multicast

39. TLB shoot-down: NUMA-aware m�cast

40. Messaging costs

41. End-to-end comparative latency

42. 2-PC pipelining


44. Terminology Domain Protection domain/address space (�process�) Dispatcher One per domain per core Scheduled by local CPU driver Invokes upcall, which then typically runs a core-local user-level thread scheduler Domain spanning Start instances of a domain on multiple cores cf start affinitized threads

45. Programming example: domain spanning 1 for i = 1..num_cores-1: 2 create a new dispatcher on core i 3 while (num_dispatchers < num_cores-1): 4 wait for the next message and handle it

46. Domain spanning: baseline Centralized: Poor scalability, but correct 1021 messages, 487 alloc. RPCs Messages here = locks on conventional OSes Conventional debugging: sampling profiler (everybody sitting in spinlock acquire), aggregate stats on cache misses etc. Specialized tools help, but messages make this a lot easier.Messages here = locks on conventional OSes Conventional debugging: sampling profiler (everybody sitting in spinlock acquire), aggregate stats on cache misses etc. Specialized tools help, but messages make this a lot easier.

47. Domain spanning: v2 Memory allocation isn�t usually thought of as a potential bottleneck on the critical path... We don�t have a partitioned memory server (yet), because it�s quite complicated. Should each core-local memory server receive a statically partitioned memory range? Or should we have a NUMA-aware hierarchy of memory servers? How and when do we adjust the amount of memory that each server has? What�s the overhead in the long term of a partitioned memory server?Memory allocation isn�t usually thought of as a potential bottleneck on the critical path... We don�t have a partitioned memory server (yet), because it�s quite complicated. Should each core-local memory server receive a statically partitioned memory range? Or should we have a NUMA-aware hierarchy of memory servers? How and when do we adjust the amount of memory that each server has? What�s the overhead in the long term of a partitioned memory server?

48. Domain spanning: v3

49. Domain spanning: v4


51. Current activity Ports to other platforms ARM (32 bit), ongoing Bee3 FPGA platform Better tracing infrastructure Parallel file system Exploration of 1-machine distributed algorithms Programming model Papers and source code http://www.barrelfish.org

The Barrelfish operating system for CMPs: research issues

The Barrelfish operating system for CMPs: research issues

Presentation Transcript

The operating system

Programming models for the Barrelfish multi-kernel operating system

The Operating System Level

The Operating System

An Operating System for the Home

An Operating System for the Home

An Operating System for the Home

The Operating System

Software: The Operating System

Alaska Research Cubesat Operating System

The Contiki Operating System

The LINUX Operating System

Token Coherence for CMPs

The Operating System

An Operating System for the Home

Using the Operating System

Using the Operating System

Operating System Issues in Multi-Processor Systems

the Operating System (OS)

The Linux Operating System

Using the Operating System

Using the Operating System