What simplifications could a compiler, or you, do without sacrifice fast execution?

William Sandqvist william@kth.se

What simplifications could a compiler, or you, do without sacrifice fast execution? William Sandqvist william@kth.se

5-7 Code optimization Two functions f and g #define MAX 10int a[MAX], b[MAX], c[MAX], x[MAX], y[MAX];int i, j, r, s;. . .int f(int a, int b){int z; z = 2 * a – b;return z;}int g(int a, int b, int c){int z; z = a * c – c * b;return z;} What code optimization can the compiler do? -O, -O0, -O1, -O2, -O3, -Os ? With the –O or –O0 you have to do all optimi-zations yourself William Sandqvist william@kth.se

Optimization flags -O, -O0 No optimization-O1 Optimize for size-O2 Optimize for speed and enable some optimization-O3 Enable all optimizations as O2, and intensive loop optimizations-Os Optimize for speed Default setting! William Sandqvist william@kth.se

Two for loops . . .for(i = 0; i <= MAX -1; i++) { x[i] = f(a[i], b[i]); }s = 2 * r;for(j = 0; j <= MAX - 1; j++) { y[j] = s * g(a[j], b[j], c[j]); } What can be done? We want shorter execution time without increasing the code! William Sandqvist william@kth.se

Loop integration The two loops have the same range (0, MAX-1), and no data dependency (x only in loop1, y only in loop2). Loops can be integrated – saves loop overhead ( only i )! s = 2 * r;for(i = 0; i <= MAX - 1; i++) { x[i] = f(a[i], b[i]); y[j] = s * g(a[j], b[j], c[j]); } William Sandqvist william@kth.se

Precalculation at compile time The defined constant MAX is used as MAX - 1 in the loop. MAX - 1 could be precalculated as 10 – 1 = 9 at compile time! s = 2 * r;for(i = 0; i <= 9; i++) { x[i] = f(a[i], b[i]); y[j] = s * g(a[j], b[j], c[j]); } William Sandqvist william@kth.se

Algebraic simplification Rewriting function g can save one multiplication operation: mul sub mul mul sub int g(int a, int b, int c){int z; z = c * (a – b);return z;} William Sandqvist william@kth.se

Inlining of functions Both functions f and g are ”short” and their code could be inserted directly in the loop. int a[10], b[10], c[10], x[10], y[10];int i, r, s;s = 2 * r;for(i = 0; i <= 9; i++) { x[i] = 2 * a[i] – b[i]; y[j] = s * ((a[i] – b[i]) * c[i]); } loop unrolling would give shorter execution time, but it would also increase the code size, so it can’t be used in this case. William Sandqvist william@kth.se

5-2 Register lifetime A processor has this instruction type: op R1, R2, R3 all three registers must be different. Code to run: u = c + d; (1)v = a – b; (2)w = a – u; (3)x = v + e; (4) How many registers are needed? William Sandqvist william@kth.se

Register Life Time Graph u = c + d; (1)v = a – b; (2)w = a – u; (3)x = v + e; (4) Four registers are needed! William Sandqvist william@kth.se

Data Flow Graph A Data Flow Graph can detect data dependencies. u = c + d; (1)v = a – b; (2)w = a – u; (3)x = v + e; (4) • Must be before (3) • Must be before (4) (2) and (3) can change execution order! William Sandqvist william@kth.se

New Register Life Time Graph New instruction order u = c + d; (1)w = a – u; (2’)v = a – b; (3’)x = v + e; (4) Now only 3 registers needed. Saving 25%. William Sandqvist william@kth.se

5-8 CDFG • Control and Data Flow Graph (CDFG) • Multiplication takes 3 cycles, all other instructions take 1 cycle. Best/Worst execution time? mode =0 TBest = 1+1= 2 y = 0;if(mode == 1) {for(i = 0; i < 5; i++) { y += a[i] * b[i]; } } mode =1 TWorst =1+1 +1+(5+1) + 5*4 +5 = 34 T = 3+1 = 4 William Sandqvist william@kth.se

Multiply – Accumulate operation c) MAC-unit! R1 = R1 + R2 * R3 in one cycle! y += a[i] * b[i]; /* one cycle */ TWorst = 1+1 +1+(5+1) + 5*1 +5 = 19 19/34 = 0.56. With MAC 56% of ordinary processor execution time. T = 1 William Sandqvist william@kth.se

Processes on a CPU William Sandqvist william@kth.se

Scheduling states of process William Sandqvist william@kth.se

Priority Driven Scheduling • Each process has fixed priority • The ready process with the highestpriority executes • Process executes until completion or preemtion by higher priority process William Sandqvist william@kth.se

Examples of sampling frequencies and execution period. Actuator servo2000 Hz RTOS GPS sensor20 Hz Process periods:GPS=1/20 =50 ms Speed =1/1000 =1 ms Joystick = 1/500 =2 ms Servo = 1/2000 =0.5 ms Speed sensor1 kHz Joystick500 Hz Tasks will often run periodicaly with different processperiods. William Sandqvist william@kth.se

Task Triplet P( max execution time, period, deadline ) deadline < = period RMS: deadline = period (simplification) William Sandqvist william@kth.se

6-2 Processor utilization and feasible scheduling Task Triplet:P(execution time, period, deadline) deadline = period P1(3, 9, 9) P2(1, 2, 2) P3(1, 6, 6) Timeline = least-common multiple of process periods 9, 2, 6 33, 2, 23 332 = 18 CPU utilization: 100% ? William Sandqvist william@kth.se

Rate Monotonic Scheduling RMS shortest period is assigned the highest priority and so on. RMS guarantee, feasible schedule exists if : In this case U = 1 so there is no guarantee! n = 3 U < 0.78 ( Limit: n =  U < 69% ) William Sandqvist william@kth.se

RMS figure Priorities: P2 > P3 > P1 (2 < 6 < 9) P1 misses the deadline! No feasible schedule with RMS! William Sandqvist william@kth.se

Earliest Deadline First Scheduling EDF guarantee, feasible schedule exists if : U 1This case U = 1, EDF shall produce a feasible schedule. William Sandqvist william@kth.se

6.3 Scheduling and semaphores P(execution time, period, deadline) P1(1, 3, 3) P2(1, 4, 4) P3(2, 6, 6) 3, 22, 23 322 = 12 RMS P1 > P2 > P3 (3 < 4 < 6) Sem1 is a binary semaphore. accessSem1() and releaseSem1() takes 0 time. William Sandqvist william@kth.se

RMS with no critical sections William Sandqvist william@kth.se

RMS with critical sections William Sandqvist william@kth.se

What simplifications could a compiler, or you, do without sacrifice fast execution?