1 / 68

Cores, cores, everywhere

Cores, cores, everywhere. Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson, Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach , Akhilesh Singhania.

eytan
Download Presentation

Cores, cores, everywhere

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cores, cores, everywhere Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson, Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania

  2. Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

  3. Amdahl’s law “Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with 128 cores, how many cores do you need to use to get a 4x speed-up on the overall program?”

  4. Amdahl’s law, f=70% Limit as c→∞ = 1/(1-f) = 3.33 Desired 4x speedup Speedup achieved (perfect scaling on 70%)

  5. Amdahl’s law, f=10% Amdahl’s law limit, just 1.11x Speedup achieved with perfect scaling

  6. Amdahl’s law, f=98%

  7. Amdahl’s law & multi-core Suppose that the same h/w budget (space or power) can make us: 1 1 2 3 4 1 2 5 6 7 8 9 10 11 12 3 4 13 14 15 16 (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

  8. Perf of big & small cores Assumption: perf = α√resource Total perf:1 * 1 = 1 Total perf:16 * 1/4 = 4 (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

  9. Amdahl’s law, f=98% 16 small 4 medium 1 big (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

  10. Amdahl’s law, f=75% 1 big 4 medium 16 small (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

  11. Asymmetric chips 1 3 4 7 8 9 10 11 12 13 14 15 16

  12. Amdahl’s law, f=75% 1+12 4 medium 1 big 16 small (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

  13. Two hardware trends Asymmetric performance and/or instruction sets Traditional multi-processor machines

  14. Cache-coherent multicore Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 RAM RAM RAM RAM RAM RAM RAM RAM L3 L3 L3 L3 AMD Istanbul: 6 cores, per-core L2, per-package L3

  15. Single-chip cloud computer (SCC) L2 Core RAM RAM MC-1 MC-3 Router MPB L2 Core RAM RAM MC-0 MC-4 VRC System interface Non-coherent caches Hardware supported messaging 24 * 2-core tiles On-chip mesh n/w

  16. MSR Beehive Module RISCN Module RISCN Module RISCN Module RISCN Core 2 Core 1 Core N Core 3 RingIn [ 31 : 0 ] , SlotTypeIn [ 3 : 0 ] , SrcDestIn [ 3 : 0 ] Module MemMux Messages , Locks MQ WD Rdreturn ( 32 bits ) RD ( 128 bits ) RA , DDR Controller RA from display WA ( pipelined bus to controller all cores ) RD to Display controller RAM Ring interconnect Message passing in h/w No cache coherence Split-phase memory access

  17. Two hardware trends Asymmetric performance and/or instruction sets Traditional multi-processor machines Non-cache-coherent access to memory

  18. Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

  19. Messaging vs shared data as default • Fundamental model is message based • “It’s better to have shared memory and not need it than to need shared memory and not have it” Barrelfishmultikernel Traditional operating systems Shared state,one-big-lock Fine-grainedlocking Clustered objects,partitioning Distributed state,replica maintenance

  20. The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica Message passing x64 x64 ARM Accelerator core Hardware interconnect

  21. The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

  22. The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node System components, each local to a specific core, and using message passing State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

  23. The Barrelfish multi-kernel OS User-mode programs: several models supported, including conventional shared-memory OpenMP & pthreads App App App App OS node OS node OS node OS node System components, each local to a specific core, and using message passing State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

  24. Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

  25. Shared Resource Database Consensus bool updatePermissions(page_t page, flags_t flags) { bool ok = true; for (core in cores) ok &= permUpdateRequest_rpc(core, page, flags); if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page, flags); } return ok; } Voting Phase Two-Phase Commit Blocking RPC before sending to next core ~400 cycles assuming process is scheduled on other core! Commit Phase

  26. Shared Resource Database Consensus bool updatePermissions(page_t page, flags_t flags) { state_t *st = malloc (sizeof(state_t)); st->ok=true; st->page=page; st->flags=flags; st->count=0; for (core in cores) { permUpdateRequest_send(core, page, flags, st); st.count++; }} void recvReply(state_tst, bool ok) { st->ok &= ok; if (st->count-- == 0) { if (st->ok) { localUpdatePermissions(st->page, st->flags); for (core in cores) permUpdateCommit_send(core, st->page, st->flags); } else { for (core in cores) permUpdateAbort_send(core, st->page , st->flags); free(st); }} Stack-Ripped Can fail to send immediately (e.g., due to full channel) Need to Stack-Rip and here and here…

  27. AC: Asynchronous C AC: Similar programing model to sync Similar performance to event-driven Synchronous Event-Driven Easy to program Difficult to program Poor Performance Good Performance

  28. Shared Resource Database Consensus bool updatePermissions(page_t page, flags_t flags) { bool ok = true; do { for (core in cores) async{ok &= permUpdateRequest_AC(core, page, flags); } } finish; if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page , flags); } return ok; } Identify code that can block – execution can continue after async AC versions of message RPCs Don’t pass finish until all asyncwork created in do {} finish block has complete

  29. Shared Resource Database Consensus AC Event-Driven Synchronous

  30. Performance Ping-pong test Minimum-sized messages • AMD 4 * 4-core machine • Using cores sharing L3 cache

  31. Performance • “Do not fear async” • Think about correctness: if the callee doesn’t block then perf is basically unchanged

  32. Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

  33. Adding Parallelism do { asyncmsg_send(core_1, “Computing Forces”); parfluidAnimate (computeForces, cells, range); } finish; Spawn a bunch of parallel tasks that can be run across multiple cores Wait for parallel andasync tasks to complete before continuing

  34. FluidAnimate • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  35. Static Partitioning • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  36. Static Partitioning • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  37. Static Partitioning Problem: Uneven workload • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  38. Static Partitioning Problem: Barrier Synchronization • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  39. Static Partitioning Problem: Thread Preemption • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame Approach taken by (e.g.) OpenMP and Intel Parallel Building Blocks They assume you own the machine and know your workload

  40. Dynamic Partitioning (Work-Stealing) • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  41. Dynamic Partitioning (Work-Stealing) • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  42. Dynamic Partitioning (Work-Stealing) • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  43. Dynamic Partitioning (Work-Stealing) • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  44. Dynamic Partitioning (Work-Stealing) • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  45. Dynamic Partitioning (Work-Stealing) • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  46. Dynamic Partitioning (Work-Stealing) Problem: Spawn / Sync Overhead • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame Cilk-5: 218 cycles per task Wool (old version): 97 cycles per task Density calculation task: ~ 10 cycles per particle

  47. Dynamic Partitioning (Work-Stealing) Problem: Cache Locality • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  48. Dynamic Partitioning (Work-Stealing) Problem: Cache Locality • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  49. Dynamic Partitioning (Work-Stealing) Problem: Cache Locality • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

  50. Dynamic Partitioning (Work-Stealing) Problem: Cache Locality • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

More Related