McRT: Many-Core Runtime

McRT: Many-Core Runtime Ali Adl-Tabatabai Anwar Ghuloum Dong Yuan Chen Rick Hudson Vijay Menon Brian Murphy Tatiana Shpeisman Bratin Saha Programming Systems Lab, MTL/CTG

What is McRT • scalable many-core runtime • Support multiple programming models (pthread, OpenMP, …) • supports multiple platforms • Simulator, SMP and sequestered systems

McRT: Architecture Applications & Libraries Media Workloads Parallel Primitives Library RMS Workloads Network Processing Workloads … Adapters for Programming Models Java Virtual Machine CILK Brook Pthread OpenMP … Scalable Core Services … Profiling Thread Scheduler Thread Synchronization Memory Management Multiple Execution Platforms CPU Simulator (Skeleton) Sequestered Core System Windows/ Linux IA-32 SMP SMAC Simulator (TA) Many Core Cache Simulator (McPLS)

McRT Scheduler Details Core with 2 HTs Core with 2 HTs Distributed run queues to reduce contention Core with 2 HTs

McRT Scheduler Details Core with 2 HTs Core with 2 HTs Distributed run queues to reduce contention Core with 2 HTs Program “main” goes into a queue

McRT Scheduler Details Core with 2 HTs Core with 2 HTs Core with 2 HTs Program “main” gets picked by a processor

McRT Scheduler Details Core with 2 HTs Core with 2 HTs New work gets added to run queues Core with 2 HTs

McRT Scheduler Details Core with 2 HTs Core with 2 HTs Knob controls work sharing Core with 2 HTs

McRT Scheduler Details Core with 2 HTs Core with 2 HTs Work sharing Keeping all cores busy Core with 2 HTs

McRT Scheduler Details Core with 2 HTs Core with 2 HTs • Work stealing • Idle processors look for work • in other cores • Knob controls degree of stealing Core with 2 HTs

McRT Scheduler Details Core with 2 HTs Core with 2 HTs Work stealing Reducing periods of idleness Core with 2 HTs

McRT On Sequestered Cores Windows host partition Sequestered Cores partition Threaded Application Threaded Application McRT McRT Scheduling, synchronization, memory management, … Windows thread partition Windows + Driver Light Weight Executive IPI / memory mapped PCI register based signaling Main Core(s) Sequestered Core(s)

McRT-Sequestered Overview • OS services (e.g. I/O) available only on the main cores • Sequestered cores used as compute device • Graphics, games, network processing, etc. • McRT manages threads on sequestered cores • LWE provides boot services & exception handling • McRT partitions HW threads & allows migration between partitions • Threads migrate from sequestered to main core for OS services • Thread migration transparent to programmer sequestered = abgesondert

McRT-Sequestered Model Windows Core McRT divides the processors into separate partitions Sequestered Cores Program “main” added to sequestered queue Sequestered Cores

McRT-Sequestered Model Windows Core McRT divides the processors into separate partitions Sequestered Cores Program “main” picked by sequestered processor Sequestered Cores

McRT-Sequestered Model Windows Core Every partition is a separate entity Sequestered Cores New work added to sequestered queues Sequestered Cores

McRT-Sequestered Model Windows Core Every partition is a separate entity Sequestered Cores Work sharing & stealing only within a partition Sequestered Cores

McRT-Sequestered Model Windows Core A task can ask McRT to change partitions e.g., migrate to OS partition, execute OS call & migrate back Sequestered Cores Sequestered Cores

Backup

McRT: Research Agenda • Common scalable many-core runtime • Support multiple programming models • Scalable runtime across multiple platforms • Simulator, SMP and sequestered systems • Reliability and programmability features • Threading platform for domain specific & general-purpose languages • Runtime support for message passing systems • McRT: A scalable and reliable software environment for the many-core platform

Outline • McRT overview • McRT many-core simulation • Results and key runtime scalability features • McRT on SMP systems • McRT on sequestered core system • Conclusions

OMP-Xvid Speedup on McRT-TA McRT Scalability: MPEG4 Nearly linear scaling till 64 HW threads on XviD MPEG4 encoder

McRT Scalability: RMS Kernels • All speedups are relative to execution time on a single core (4 threads) • Good scalability till 64 HW threads

McRT: Key Scalability Features • User-level synchronization primitives • Multiple locking algorithms & barrier implementations • User-level monitor & mwait for efficient HW spin waiting • User-level thread scheduler • Supports 128+ HW threads • Continuation-based threading/ task-based model • Distributed work queues with support for work stealing and sharing • Supports partitioning (used in sequestered platform) • User-level memory manager • Size segregated thread local allocation pools • Completely non-blocking implementation

McRT Core Services: Scalability Improvements • Single queue gives best load balancing but suffers from contention • Queued locks deal better with contention at large # of HW threads • Distributed queues eliminate contention but don’t balance load • Stealing gives best of all worlds: load balancing + no contention

Instructions executed by different worker threads (32 HW thread config) 2.5E+07 2.0E+07 1.5E+07 instructions 1.0E+07 5.0E+06 0.0E+00 XviD Equake Need For Custom Scheduling Equake tasks have good load balance  Stealing adds overhead XviD has load imbalance among tasks  Stealing helps

Outline • McRT overview • McRT many-core simulation • McRT on SMP systems • Key challenges and results • McRT on sequestered core system • Conclusions

McRT On SMP Systems • Key challenge: • Efficient coupling between user-level runtime & OS • Key McRT features: • Novel synchronization library • Queue based synchronization supporting cancellation and timeout • User-level spin waiting + scheduler-level blocking • Linux & Windows kernel-level blocking for efficient 1:1 scheduling • Predicated continuations for efficient M:N scheduling • Non-blocking data structures • Provides preemption safety and greater resilience to thread delays

Application uses standard OpenMP McRT On SMP: Results McRT and the native (OpenMP) runtime running on the same 16way IBM SMP Linux system Both McRT & native speedups are relative to the execution time for 1P on the native (OpenMP) runtime

SEMPHY Speedup: Details McRT scheduler can provide the advantages of a task queue  Better programmability

Outline • McRT overview • McRT many-core simulation • McRT on SMP systems • McRT on sequestered core system • Architecture, challenges, and results • Conclusions

Sequestered core Stuff • See main part of presentation

McRT-Sequestered Results All speedups are relative to the execution time for 1P on the native (OpenMP) runtime Native: OpenMP on 8P SMP(all processors running Win 2003) McRT-OS: McRT on the same 8P SMP(all processors running Win 2003) McRT-BareM: McRT on the same 8P SMP(1P running Win 2003, 7P sequestered) K processor McRT-BareMetal mode has K-1 sequestered and 1 Win 2003 processor

Conclusions • Provide a scalable many-core software environment • Support multiple parallel programming models • Abstract away the execution platform • Good performance on SMP, sequestered system and simulation • Enhance many-core reliability and programmability • Transactional memory • Software virtualized transactional memory • Transactional data structures and algorithms • Speculative and implicit parallelism

Collaborators • Platform Architecture Research(PAR/MTL): McPLS simulator • Architecture Research Lab(ARL/MTL): RMS workloads • PDSD (SSG): OpenMP library • Doug Carmean, Eric Sprangle, Anwar Rohillah: TA simulator • Streaming Media Lab (SMAL/MTL): Sequestered core system • Network Architecture Lab (NAL/CTL): Packet processing applications

Backup

Nehalem Bonnell Comparison • Nehalem simulated with Skeleton • Bonnell simulated with TA • Instruction counts & execution phases line up nicely

McRT: Many-Core Runtime