Multi-Level Adaptive Loop Scheduler for Efficient Power5 Architecture Performance

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto Lab 6-Nov-14

Agenda • Background • Motivation • Previous Work • Adaptive Schedulers • IBM Power 5 Architecture • A Multi-Level Hierarchical Scheduler • Evaluation • Future Work

Simultaneous Multi-Threading • Architecture • Several threads per physical processor • Threads share • Caches • Registers • Functional Units

Power 5 SMT

OpenMP • OpenMP • A standard API for shared memory programming • Add directives for parallel regions • Standard Loop Schedulers • Static • Dynamic • Guided • Runtime

#pragma omp parallel for shared(a, b) private(i, j) schedule(runtime) for ( i = 0; i < 100; i ++ ) { for ( j = 0; j < 100; j ++) { a[i , j] = a[i , j] + b[i , j]; } } An example of a parallel loop in C code. (Similar in Fortran) …….. …….. …. …. …. …. …. …….. i j OpenMP API T0 Tn

Motivation • OpenMP Applications • Designed for SMP systems • Not aware of HT technology • Understanding and controlling performance of OpenMP applications on SMT processors is not trivial • Important performance issues on SMP system with SMT nodes • Inter-thread data locality • Instruction Mix • SMT-related Load Balance

Scaling (Spec & NAS) 4 Intel Xeon Processors with Hyperthreading 1 Thread per Processor 1-2 Threads per Processor

Why do they scale poorly? • Inter-thread data locality • cache misses • Instruction Mix • functional units sharing • benefit gained this way may outweigh cache misses • SMT-related Load Balance We should balance work loads well among: • processors • threads running on the same physical processor.

Previous Work:Runtime Adaptive Scheduler • Hierarchical Scheduling • Upper level scheduler • Lower level scheduler • Select scheduler and the number of threads to run at runtime • One thread per physical processor • Two threads per physical processor

Two-Level Hierarchical Scheduler

…….. …….. …. …. …. …. …. …….. …. Traditional Scheduling Static Scheduling Dynamic Scheduling i i …….. …….. j j …. …. …. …. …….. T0 Tn T0 Tn Ti Tk

…. …. …. …. Hierarchical Scheduling i …….. …….. …….. …….. Dynamic Scheduling …. …. …. …. …. …. …. j …….. …….. P0 …….. Pi Static Scheduling T00 T01 Ti0 Ti1

Why can we benefit fromruntime scheduler selection? • Many parallel loops in OpenMP applications are executed again and again. • Example for (k = 1; k<100; k++) { …………. calculate(); ………….} void calculate () {#pragma omp parallel for schedule(runtime) for (i = 1; i<100; i++) { ……………; // calculation }}

Adaptive Schedulers • Region Based Scheduler • Select loop schedulers at runtime • Parallel loops in one parallel region have to use the same scheduler which may not be the best • Loop Based Scheduler • Higher runtime overhead • More accurate loop scheduler for each parallel loop

Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel loop based scheduler picks a scheduler region based scheduler picks one scheduler that applies to all three loops loop based scheduler picks a scheduler loop based scheduler picks a scheduler

Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Static Scheduler P0 P2 P1 P3 T0 T1 T2 T3

Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Dynamic Scheduler P0 P2 P1 P3 T0 T1 T2 T3

Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Affinity Scheduler P0 P2 P1 P3 T0 T1 T2 T3

M1 P0 P0 P1 P1 T1 T2 T3 T4 T5 T6 T7 Runtime Loop Scheduler Selection Phase 1:Made a decision on upper level scheduler, try lower level scheduler, run with 8 threads………… Affinity Scheduler Static T0

Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel Static-Static, 8 threads TSS, 4 threads TSS, 4 threads

Hardware Counter Scheduler • Motivation • The RBS and LBS has runtime overhead. They will work even better if we can reduce the overhead as much as possible • Algorithm • Try different schedulers on parallel loops on a subset of the benchmarks using training data • Use the characteristic: cache miss, number of floating point operations, number of micro-ops, load imbalance and the best scheduler for that loop as input • Feed the above data to classification software (we use C4.5) to build a decision tree • Apply this decision tree to a loop at runtime. Feed the runtime collected hardware counter data as input, and get the result – scheduler – as output.

4 Intel Xeon Processors with Hyperthreading

IBM Power 5 • Technology: 130nm • Dual processor core • 8-way superscalar • Simultaneous Multi-Threaded (SMT) core • Up to 2 virtual processors • 24% area growth per core for SMT • Natural extension to Power 4 design

Single Thread • Single Thread has advantage when executing unit limited applications • Floating or fixed point intensive workloads • Extra resources necessary for SMT provide higher performance benefit when dedicated to a single thread • Data locality on one SMT core is better with single thread for some applications

Power 5 Multi-Chip Module (MCM) • Or Multi-Chipped Monster • 4 processor chips • 2 processors per chip • 4 L3 cache chips

Power5 64-way Plane Topology • Each MCM has 4 inter-connected processor chips • Each processor chip has two processors on chip • Each processor has SMT technology therefore two threads can be executed on it simultaneously

Multi-Level Scheduler Loop Iterations 1st LevelScheduler ……. ……. Iterations for Module 1 Iterations for Module i Iterations for Module n ………………. 2nd LevelScheduler 2nd LevelScheduler Iterations for Processor 1 ………………. Iterations for Processor m Iterations for Processor 1 Iterations for Processor m ………………. 3rd LevelScheduler 3rd LevelScheduler ………………. Iterations for Thread 1 Iterations for Thread k Iterations for Thread 1 Iterations for Thread k

OpenMP Implementation • Outline Technique • New subroutines created with body of each parallel construct • Runtime routines receives as a parameter the address of the outlined procedure

Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library 1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…) long main { _xlsmpParallelDoSetup_TPO(…) } while (still iterations left, go to get some iterations for this thread) { ………… call main@OL@1(...); …………. } void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } Outlined Functions

Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library 1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…) long main { _xlsmpParallelDoSetup_TPO(…) } while (hier_sched(…))) { ………… call main@OL@1(...); …………. } void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } Outlined Functions

Root Guided M0 M1 Static Cyclic P0 P0 P1 P1 T0 T1 T2 T3 T4 T5 T6 T7 • Lookup its parents iteration list to see if there is any iteration available; if yes, get some iterations from the 2nd level scheduler and return • Look one level up, grab the lock for its group, and seek more iterations from the upper level using the upper level loop scheduler (a recursive function call) till it gets some iteration or the whole loop ends

Hierarchical Scheduler • Guided as the 1st level scheduler • Balance work loads among processors • Reduce runtime overhead • Static Cyclic as the 2nd level scheduler • Improve cache locality • Reduce runtime overhead …. …. T0 T0 T1 T0 T1 T0 T1 T0 T1 T1 Iteration space dividing using standard static scheduling Iteration space dividing using static cyclic scheduling

Evaluation • IBM Power 5 System • 4 Power 5 1904 MHz SMT processors • 31872 M memory • Operating System • AIX 5.3 • Compiler: • IBM XL C/C++, XL Fortran compiler • Benchmark • SpecOMP2001

Scalability of IBM Power 5 SMT Processors 1 through 8 threads

Evaluation on Power 5Execution Time Normalized to Default (Static) Scheduler

Conclusion • Standard schedulers are not aware of SMT technology • Adaptive hierarchical schedulers take SMT specific characteristics into account, which could make OpenMP API (software) and SMT technology (hardware) work better together. • OpenMP parallel applications running on Power 5 architecture with SMT has the same problem • Multi-level hierarchical scheduler designed for IBM Power 5 achieves an average improvement over the default loop scheduler of 3% on SPEC OMP2001 • Large improvements of 7% and 11% on some benchmarks • Improves on average over all other standard OpenMP loop schedulers by at least 2%

Future Work • Evaluate multi-level hierarchical scheduler on a larger system with 32 SMT processors (with MCM) • Explore performance on auto-parallelized benchmarks (SPEC CPU FP) • Examine mechanisms for determining best scheduler configuration at compile-time • Explore the use of helper threads on Power 5 • Cache prefetching

Thank You~

(A cache miss comparison chart will be shown here) • If find a way to calculate the overall L2 load/store miss generally. • If not, will show the overhead of this optimization from the tprof data.

Schedulers’ Speedup on 4 threads

Scheduler’s Speedup on 8 Threads

Only one decision tree is built offline, before executing the program Apply that decision tree to loops at runtime without changing the tree Make a decision on which scheduler we should use with only one run of each loop, which greatly reduces runtime scheduling overhead uops <= 3.62885e+08 : | cachemiss <= 111979 : | | uops > 748339 : static-4 | | uops <= 748339 : | | | l/s <= 167693 : static-4 ( | | | l/s > 167693 : static-static | cachemiss > 111979 : | | floatpoint <= 1.52397e+07 : | | | cachemiss <= 384690 : | | | | uops <= 2.06431e+07 : static-static | | | | uops > 2.06431e+07 : | | | | | imbalance <= 1330 : afs-static | | | | | imbalance > 1330 : | | | | | | cachemiss <= 301582 : afs-4 | | | | | | cachemiss > 301582 : guided-static ……………………………. uops > 3.62885e+08 : | l/s > 7.22489e+08 : static-4 | l/s <= 7.22489e+08 : | | imbalance <= 32236 : static-4 | | imbalance > 32236 : | | | floatpoint <= 5.34465e+07 : static-4 | | | floatpoint > 5.34465e+07 : | | | | floatpoint <= 1.20539e+08 : tss-4 | | | | floatpoint > 1.20539e+08 : | | | | | floatpoint <= 1.45588e+08 : static-4 | | | | | floatpoint > 1.45588e+08 : tss-4 END hardware-counter scheduling END hardware-counter scheduling Decision Tree

(Load imbalance comparison chart will be shown here) • Generating……..

Multi-Level Adaptive Loop Scheduler for Efficient Power5 Architecture Performance

Multi-Level Adaptive Loop Scheduler for Efficient Power5 Architecture Performance

Presentation Transcript

A Two-Level Architecture for Internet Signaling

Adaptive Leadership Architecture

Vassal: Loadable Scheduler Support for Multi-Policy Scheduling

Chapter 18.5: An Architecture for a Locking Scheduler

A NOVEL MULTI-HOP B3G ARCHITECTURE FOR ADAPTIVE GATEWAY MANAGEMENT IN HETEROGENEOUS WIRELESS NETWORKS

No Power Struggles:Coordinated multi-level power management for the data center

A Multi Agent Architecture for Tourism Recommendation

Revolver: Processor Architecture for Power Efficient Loop Execution

Concurrency Control Architecture for a Locking Scheduler Section 18.5

A Multi-Threading Architecture…

Concurrency Control Architecture for a Locking Scheduler Section 18.5

Applying a Multi-level Security Mechanism to a Network Address Translation Scheduler

Multi-level Adaptive Prefetching based on Performance Gradient Tracking

JuxMem : a multi-protocol architecture

MARS: Adaptive Remote Execution Scheduler for Multithreaded Mobile Devices

Multi-Level Architecture for Data Plane Virtualization

Multi-loop Circuits:

A Combinatorial Architecture for Instruction-Level Parallelism

Surveying a Level Loop

Instruction Level Parallelism: Loop Level Parallelism

SODA: A Low-power (Multi-Core) Architecture For Software Radio

Multi-level Adaptive Prefetching based on Performance Gradient Tracking