Task Level Concurrency & High-Level Optimization in Embedded Systems

Ch. 7. Optimization Optimization 7.1 Task level concurrency management 7.2 High-level optimizations 7.3 Compilers for embedded systems 7.4 Power management and Thermal management

7.1 task level concurrency management • The task graphs’ granularity is one of their most important properties. Even for hierarchical task graphs, it may be useful to change the granularity of the nodes. • The partitioning of specifications into tasks or processes does not necessarily aim at the maximum implementation efficiency. Rather, during the specification phase, a clear separation of concerns and a clean software model are more important than caring about the implementation too much. • Merging of task graphs can be performed whenever some task Ti is the immediate predecessor of some other task Tjand if Tj does not have any other immediate predecessor. Optimization T2 T2 T5 T1 T5 T1 T3 T4 T3*

This transformation can lead to a reduced overhead of context-switches if the node is implemented in software, and it can lead to a larger potential for optimization in general. • Task may be holding resources (like large amounts of memory) while they are waiting for some input. In order to maximize the use of these resources, it may be best to constrain the use of these resources to the time intervals during which these resources are actually needed. • The most appropriate task graph granularity depends upon the context  merging and splitting may be required. • Merging and splitting of tasks should be done automatically, depending upon the context. Optimization

7.2 High-level optimization 7.2.1 Floating-point to fixed-point conversion • Fixed-Point Representation old position 7 6 5 4 3 2 1 0 New position 4 3 2 1 0 -1 -2 -3 Bit pattern 1 0 0 1 1 . 1 0 1 Contribution 24 21 20 2-1 2-3 =19.625 • Floating point format  M B E Optimization Sign mantissa or base exponent significand Sign Exponent Mantissa

j=0 j=1 j=2 • For many signal processing application, it is possible to replace floating numbers with fixed-point numbers. The benefits may be significant. For example, a reduction of the cycle count by 73% and of the energy consumption by 76% has been reported for an MPEG-2 video compression algorithm. 7.2.2 Simple loop transformations • The following is a list of standard loop transformations: • Loop permutation: Consider a two-dimensional array. According to the C standard, two-dimensional arrays are laid out in memory as shown to Fig. 7.6. • Array p[j][k], Row major order (C) Optimization

Two loops, assuming row major order (C): for (k=0; k<=m; k++) for (j=0; j<=n; j++) for (j=0; j<=n; j++) ) for (k=0; k<=m; k++) p[j][k] = ... p[j][k] = ... • Loop fusion, loop fission: There may be cases in which two separate loops can be merged, and there may be cases in which a single loop is split into two. for(j=0; j<=n; j++) for (j=0; j<=n; j++) p[j]= ... ; {p[j]= ... ; for (j=0; j<=n; j++) , p[j]= p[j] + ...} p[j]= p[j] + ... <small loops> <improved cache behavior> • Loop unrolling: Loop unrolling is a standard transformation creating several instances of the loop body. for (j=0; j<=n; j++) for (j=0; j<=n; j+=2) p[j]= ... ;{p[j]= ... ; p[j+1]= ...} • Unrolling factor = 2 • Unrolling reduce the loop overhead and therefore typically improves the speed. Optimization for I = exp1/2 to exp2 /2 A(2l) A(2l+1) for I = exp1 to exp2 A(I)

7.2.3 Loop tiling/blocking i j Loop Tiling i Optimization for(i=0;i<9;i++) A[i] = ...; for(j=0; j<3; j++) for(i=4*j; i<4*j+4; i++) if (i<9) A[i] = ...;

i++ i++ for (i=1; i<=N; i++) for(k=1; k<=N; k++){ r=X[i,k]; /* to be allocated to a register*/ for (j=1; j<=N; j++) Z[i,j] += r* Y[k,j] } % Never reusing information in the cache for Y and Z if N is large or cache is small (2 N³ references for Z). Optimization i++ k++ k++ j++ j++ j++ k++

for (kk=1; kk<= N; kk+=B) • for (jj=1; jj<= N; jj+=B) • for (i=1; i<= N; i++) • for (k=kk; k<= min(kk+B-1,N); k++){ • r=X[i][k]; /* to be allocated to a register*/ • for (j=jj; j<= min(jj+B-1, N); j++) • Z[i][j] += r* Y[k][j] • } Same elements for next iteration of i Optimization k++ kk k++ j++ jj jj i++ i++ k++, j++

7.2.4 Loop splitting • Loop splitting breaks a loop into multiple loops which have the same bodies but iterate over different contiguous portions of the index range. 7.2.5 Array folding Optimization

Optimization

7.3 Compilers for embedded systems 7.3.1 Introduction • There are several reasons for designing special optimizations and compilers for embedded systems: • Processor architectures in embedded systems exhibit special features. • A high efficiency of the code is more important than a high compilation speed. • Compilers could potentially help to meet and prove real-time constraints. • Compilers may help to reduce the energy consumption of embedded systems. • For embedded systems, there is a larger variety of instruction sets. Hence, there are more processors for which compilers should be available. 7.3.2 Energy-aware compilation • The following compiler optimizations have been used for reducing the energy consumption: • Energy-aware scheduling • Energy-aware instruction selection • Replacing the cost function • Exploitation of the memory hierarchy Optimization

7.4 Power Management and Thermal Management 7.4.1 Dynamic voltage scaling (DVS) • The power consumption P of CMOS circuits is given by: •  : switching activity • CL: the load capacitance • Vdd: the supply voltage • f: the clock frequency • Task that needs to execute 109 cycles within 25 seconds. Optimization Ea= 109 x 40 x 10 -9 [J] = 40 [J]

Eb= 750 106 x 40 10 –9 +250 106 x 10 10-9 [J] = 32.5 [J] Ec= 109 x 25 x 10 -9 [J] = 25 [J] Optimization • A minimum energy consumption is achieved for the ideal supply voltage of 4 Volts.

Optimization

Task Level Concurrency & High-Level Optimization in Embedded Systems

Task Level Concurrency & High-Level Optimization in Embedded Systems

Presentation Transcript

Ch 7

Ch 7

Ch. 7

Ch.7

Ch. 7

Ch. 7

Ch. 7

Ch. 7

Ch. 7

Ch 7

(ch. 7)

CH 7

Ch. 7

Ch. 7

Ch. 7

Ch. 7

Ch. 7

CH: 7

Ch 7

Ch.7

(ch. 7)

Ch 7