1 / 37

McRT: Many-Core Runtime

McRT: Many-Core Runtime. Ali Adl-Tabatabai Anwar Ghuloum Dong Yuan Chen Rick Hudson. Vijay Menon Brian Murphy Tatiana Shpeisman Bratin Saha. Programming Systems Lab, MTL/CTG. What is McRT. scalable many-core runtime Support multiple programming models (pthread, OpenMP, …)

svea
Download Presentation

McRT: Many-Core Runtime

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. McRT: Many-Core Runtime Ali Adl-Tabatabai Anwar Ghuloum Dong Yuan Chen Rick Hudson Vijay Menon Brian Murphy Tatiana Shpeisman Bratin Saha Programming Systems Lab, MTL/CTG

  2. What is McRT • scalable many-core runtime • Support multiple programming models (pthread, OpenMP, …) • supports multiple platforms • Simulator, SMP and sequestered systems

  3. McRT: Architecture Applications & Libraries Media Workloads Parallel Primitives Library RMS Workloads Network Processing Workloads … Adapters for Programming Models Java Virtual Machine CILK Brook Pthread OpenMP … Scalable Core Services … Profiling Thread Scheduler Thread Synchronization Memory Management Multiple Execution Platforms CPU Simulator (Skeleton) Sequestered Core System Windows/ Linux IA-32 SMP SMAC Simulator (TA) Many Core Cache Simulator (McPLS)

  4. McRT Scheduler Details Core with 2 HTs Core with 2 HTs Distributed run queues to reduce contention Core with 2 HTs

  5. McRT Scheduler Details Core with 2 HTs Core with 2 HTs Distributed run queues to reduce contention Core with 2 HTs Program “main” goes into a queue

  6. McRT Scheduler Details Core with 2 HTs Core with 2 HTs Core with 2 HTs Program “main” gets picked by a processor

  7. McRT Scheduler Details Core with 2 HTs Core with 2 HTs New work gets added to run queues Core with 2 HTs

  8. McRT Scheduler Details Core with 2 HTs Core with 2 HTs Knob controls work sharing Core with 2 HTs

  9. McRT Scheduler Details Core with 2 HTs Core with 2 HTs Work sharing Keeping all cores busy Core with 2 HTs

  10. McRT Scheduler Details Core with 2 HTs Core with 2 HTs • Work stealing • Idle processors look for work • in other cores • Knob controls degree of stealing Core with 2 HTs

  11. McRT Scheduler Details Core with 2 HTs Core with 2 HTs Work stealing Reducing periods of idleness Core with 2 HTs

  12. McRT On Sequestered Cores Windows host partition Sequestered Cores partition Threaded Application Threaded Application McRT McRT Scheduling, synchronization, memory management, … Windows thread partition Windows + Driver Light Weight Executive IPI / memory mapped PCI register based signaling Main Core(s) Sequestered Core(s)

  13. McRT-Sequestered Overview • OS services (e.g. I/O) available only on the main cores • Sequestered cores used as compute device • Graphics, games, network processing, etc. • McRT manages threads on sequestered cores • LWE provides boot services & exception handling • McRT partitions HW threads & allows migration between partitions • Threads migrate from sequestered to main core for OS services • Thread migration transparent to programmer sequestered = abgesondert

  14. McRT-Sequestered Model Windows Core McRT divides the processors into separate partitions Sequestered Cores Program “main” added to sequestered queue Sequestered Cores

  15. McRT-Sequestered Model Windows Core McRT divides the processors into separate partitions Sequestered Cores Program “main” picked by sequestered processor Sequestered Cores

  16. McRT-Sequestered Model Windows Core Every partition is a separate entity Sequestered Cores New work added to sequestered queues Sequestered Cores

  17. McRT-Sequestered Model Windows Core Every partition is a separate entity Sequestered Cores Work sharing & stealing only within a partition Sequestered Cores

  18. McRT-Sequestered Model Windows Core A task can ask McRT to change partitions e.g., migrate to OS partition, execute OS call & migrate back Sequestered Cores Sequestered Cores

  19. Backup

  20. McRT: Research Agenda • Common scalable many-core runtime • Support multiple programming models • Scalable runtime across multiple platforms • Simulator, SMP and sequestered systems • Reliability and programmability features • Threading platform for domain specific & general-purpose languages • Runtime support for message passing systems • McRT: A scalable and reliable software environment for the many-core platform

  21. Outline • McRT overview • McRT many-core simulation • Results and key runtime scalability features • McRT on SMP systems • McRT on sequestered core system • Conclusions

  22. OMP-Xvid Speedup on McRT-TA McRT Scalability: MPEG4 Nearly linear scaling till 64 HW threads on XviD MPEG4 encoder

  23. McRT Scalability: RMS Kernels • All speedups are relative to execution time on a single core (4 threads) • Good scalability till 64 HW threads

  24. McRT: Key Scalability Features • User-level synchronization primitives • Multiple locking algorithms & barrier implementations • User-level monitor & mwait for efficient HW spin waiting • User-level thread scheduler • Supports 128+ HW threads • Continuation-based threading/ task-based model • Distributed work queues with support for work stealing and sharing • Supports partitioning (used in sequestered platform) • User-level memory manager • Size segregated thread local allocation pools • Completely non-blocking implementation

  25. McRT Core Services: Scalability Improvements • Single queue gives best load balancing but suffers from contention • Queued locks deal better with contention at large # of HW threads • Distributed queues eliminate contention but don’t balance load • Stealing gives best of all worlds: load balancing + no contention

  26. Instructions executed by different worker threads (32 HW thread config) 2.5E+07 2.0E+07 1.5E+07 instructions 1.0E+07 5.0E+06 0.0E+00 XviD Equake Need For Custom Scheduling Equake tasks have good load balance  Stealing adds overhead XviD has load imbalance among tasks  Stealing helps

  27. Outline • McRT overview • McRT many-core simulation • McRT on SMP systems • Key challenges and results • McRT on sequestered core system • Conclusions

  28. McRT On SMP Systems • Key challenge: • Efficient coupling between user-level runtime & OS • Key McRT features: • Novel synchronization library • Queue based synchronization supporting cancellation and timeout • User-level spin waiting + scheduler-level blocking • Linux & Windows kernel-level blocking for efficient 1:1 scheduling • Predicated continuations for efficient M:N scheduling • Non-blocking data structures • Provides preemption safety and greater resilience to thread delays

  29. Application uses standard OpenMP McRT On SMP: Results McRT and the native (OpenMP) runtime running on the same 16way IBM SMP Linux system Both McRT & native speedups are relative to the execution time for 1P on the native (OpenMP) runtime

  30. SEMPHY Speedup: Details McRT scheduler can provide the advantages of a task queue  Better programmability

  31. Outline • McRT overview • McRT many-core simulation • McRT on SMP systems • McRT on sequestered core system • Architecture, challenges, and results • Conclusions

  32. Sequestered core Stuff • See main part of presentation

  33. McRT-Sequestered Results All speedups are relative to the execution time for 1P on the native (OpenMP) runtime Native: OpenMP on 8P SMP(all processors running Win 2003) McRT-OS: McRT on the same 8P SMP(all processors running Win 2003) McRT-BareM: McRT on the same 8P SMP(1P running Win 2003, 7P sequestered) K processor McRT-BareMetal mode has K-1 sequestered and 1 Win 2003 processor

  34. Conclusions • Provide a scalable many-core software environment • Support multiple parallel programming models • Abstract away the execution platform • Good performance on SMP, sequestered system and simulation • Enhance many-core reliability and programmability • Transactional memory • Software virtualized transactional memory • Transactional data structures and algorithms • Speculative and implicit parallelism

  35. Collaborators • Platform Architecture Research(PAR/MTL): McPLS simulator • Architecture Research Lab(ARL/MTL): RMS workloads • PDSD (SSG): OpenMP library • Doug Carmean, Eric Sprangle, Anwar Rohillah: TA simulator • Streaming Media Lab (SMAL/MTL): Sequestered core system • Network Architecture Lab (NAL/CTL): Packet processing applications

  36. Backup

  37. Nehalem Bonnell Comparison • Nehalem simulated with Skeleton • Bonnell simulated with TA • Instruction counts & execution phases line up nicely

More Related