500 likes | 746 Views
The Barrelfish project. Collaboration between ETH Zurich and MSRCAndrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Sch?pbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario S
E N D
1. The Barrelfish operating system for CMPs: research issues Tim Harris
2. The Barrelfish project Collaboration between ETH Zurich and MSRC
Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Simon Peter, Jan Rellermeyer, Timothy Roscoe, Adrian Schpbach, Akhilesh Singhania, Pierre-Evariste Dagand, Ankush Gupta, Raffaele Sandrini, Dario Simone, Animesh Trivedi
3. Introduction
Hardware and workloads
Multikernel design principles
Communication costs
Starting a domain
4. Do we need a new OS?
5. Do we need a new OS?
6. Do we need a new OS? How might the design of a CMP differ from these existing systems?
How might the workloads for a CMP differ from those of existing multi-processor machines?
7. The cliched single-threaded perf graph
8. Interactive perf
9. CC-NUMA architecture
10. Machine architecture
11. Machine diversity: AMD 4-core
12. ...Sun Niagara-2
13. ...Sun Rock
15. Introduction
Hardware and workloads
Multikernel design principles
Communication costs
Starting a domain
16. The multikernel model Apps can still use shared memory!
Explicit communication via messages:
- Means no shared memory (except for the endpoints of communication channels)
- Knowledge of what parts of shared state are accessed when and by who is exposed
Can analyse
Can optimise
Can modify
- Supports split-phase operations
Do useful work, or sleep, while waiting for the reply
Hardware neutral:
- Architecture-specific parts are confined to
The messaging transport
The interface to the actual hardware
- Easy to plug in different messaging algorithms
OS state is replicatedApps can still use shared memory!
Explicit communication via messages:
- Means no shared memory (except for the endpoints of communication channels)
- Knowledge of what parts of shared state are accessed when and by who is exposed
Can analyse
Can optimise
Can modify
- Supports split-phase operations
Do useful work, or sleep, while waiting for the reply
Hardware neutral:
- Architecture-specific parts are confined to
The messaging transport
The interface to the actual hardware
- Easy to plug in different messaging algorithms
OS state is replicated
17. Barrelfish: a multikernel OS A new OS architecture for scalable multicore systems
Approach: structure the OS as a distributed system
Design principles:
Make inter-core communication explicit
Make OS structure hardware-neutral
View state as replicated
18. #1 Explicit inter-core communication All communication with messages
Decouples system structure from inter-core communication mechanism
Communication patterns explicitly expressed
Better match for future hardware
Naturally supports heterogeneous cores, non-coherent interconnects (PCIe)
with cheap explicit message passing
without cache-coherence (e.g. Intel 80-core)
Allows split-phase operations
19. Communication latency
20. Communication latency
21. Message passing vs shared memory Shared memory (move the data to the operation):
Each core updates the same memory locations
Cache-coherence migrates modified cache lines
22. Shared memory scaling & latency
23. Message passing Message passing (move operation to the data):
A single server core updates the memory locations
Each client core sends RPCs to the server
24. Message passing
25. Message passing
26. #2 Hardware-neutral structure Separate OS structure from hardware
Only hardware-specific parts:
Message transports (highly optimised / specialised)
CPU / device drivers
Adaptability to changing performance characteristics
Late-bind protocol and message transport implementations
27. #3 Replicate common state Potentially-shared state accessed as if it were a local replica
Scheduler queues, process control blocks, etc.
Required by message-passing model
Naturally supports domains that do not share memory
Naturally supports changes to the set of running cores
Hotplug, power management
28. Replication vs sharing as the default Replicas used as an optimisation in other systems
In a multikernel, sharing is a local optimisation
Shared (locked) replica on closely-coupled cores
Only when faster, as decided at runtime
Basic model remains split-phase
29. Introduction
Hardware and workloads
Multikernel design principles
Communication costs
Starting a domain
30. Applications running on Barrelfish Slide viewer (but not today...)
Webserver (www.barrelfish.org)
Virtual machine monitor (runs unmodified Linux)
Parallel benchmarks:
SPLASH-2
OpenMP
SQLite
ECLiPSe (constraint engine)
more. . .
31. Two hyper-transport requests on AMD 1-way URPC message costs
32. Local vs remote messaging URPC to a remote core compares favourably with IPC
No context switch: TLB unaffected
Lower cache impact
Higher throughput for pipelined messages
33. Communication perf: IP loopback 2*2-core AMD system, 1000-byte packets
Linux: copy in / out of shared kernel buffers
Barrelfish: point-to-point URPC channel
34. Case study: TLB shoot-down Send a message to every core with a mapping
Wait for acks
Linux/Windows:
Send IPI
Spin on shared ack count
Barrelfish:
Request to local monitor domain
1-phase commit to remote cores
Plug in different communication mechanisms
35. TLB shoot-down: n*unicast
36. TLB shoot-down: 1*broadcast
37. Messaging costs
38. TLB shoot-down: multicast
39. TLB shoot-down: NUMA-aware mcast
40. Messaging costs
41. End-to-end comparative latency
42. 2-PC pipelining
43. Introduction
Hardware and workloads
Multikernel design principles
Communication costs
Starting a domain
44. Terminology Domain
Protection domain/address space (process)
Dispatcher
One per domain per core
Scheduled by local CPU driver
Invokes upcall, which then typically runs a core-local user-level thread scheduler
Domain spanning
Start instances of a domain on multiple cores
cf start affinitized threads
45. Programming example: domain spanning 1 for i = 1..num_cores-1:
2 create a new dispatcher on core i
3 while (num_dispatchers < num_cores-1):
4 wait for the next message and handle it
46. Domain spanning: baseline Centralized:
Poor scalability, but correct
1021 messages, 487 alloc. RPCs Messages here = locks on conventional OSes
Conventional debugging: sampling profiler (everybody sitting in spinlock acquire), aggregate stats on cache misses etc. Specialized tools help, but messages make this a lot easier.Messages here = locks on conventional OSes
Conventional debugging: sampling profiler (everybody sitting in spinlock acquire), aggregate stats on cache misses etc. Specialized tools help, but messages make this a lot easier.
47. Domain spanning: v2 Memory allocation isnt usually thought of as a potential bottleneck on the critical path...
We dont have a partitioned memory server (yet), because its quite complicated. Should each core-local memory server receive a statically partitioned memory range? Or should we have a NUMA-aware hierarchy of memory servers? How and when do we adjust the amount of memory that each server has? Whats the overhead in the long term of a partitioned memory server?Memory allocation isnt usually thought of as a potential bottleneck on the critical path...
We dont have a partitioned memory server (yet), because its quite complicated. Should each core-local memory server receive a statically partitioned memory range? Or should we have a NUMA-aware hierarchy of memory servers? How and when do we adjust the amount of memory that each server has? Whats the overhead in the long term of a partitioned memory server?
48. Domain spanning: v3
49. Domain spanning: v4
50. Introduction
Hardware and workloads
Multikernel design principles
Communication costs
Starting a domain
51. Current activity Ports to other platforms
ARM (32 bit), ongoing
Bee3 FPGA platform
Better tracing infrastructure
Parallel file system
Exploration of 1-machine distributed algorithms
Programming model
Papers and source code
http://www.barrelfish.org