OPTIMIZING C CODE FOR THE ARM PROCESSOR

OPTIMIZING C CODE FOR THE ARM PROCESSOR • Optimizing code takes time and reduces source code readability • Usually done for functions that are critical for performance or power consumption and are executed frequently • Usually in combination with profiling

LOCAL VARIABLES • ARM registers are 32-bit. Therefore it is more efficient to use 32-bit data types • Use signed and unsigned integer types and avoid char and short • Only exception is if you want wraparound to occur • Unsigned int is more efficient for division

LOOP STRUCTURES (incrementing for loop) checksum_v5 MOV r2,r0 ; r2=data MOV r0,#0 ; sum=0 MOV r1,#0 ; i=0 checksum_v5_loop LDR r3,[r2],#4 ; r3 = *(data++) ADD r1,r1,#1 ; i++ CMP r1,#0x40 ; compare i, 64 ADD r0, r3, r0 ; sum += r3 BCC checksum_v5_loop ; if (i<64) goto loop MOV pc,r14 ; return sum int checksum_v5(int *data) { unsigned int i; int sum=0; for (i=0; i<64; i++) { sum +=*(data++); } return sum; }

LOOP STRUCTURES (decrementing for loop) checksum_v6 MOV r2,r0 ; r2=data MOV r0,#0 ; sum=0 MOV r1,#0x40 ; i=64 checksum_v6_loop LDR r3,[r2],#4 ; r3 = *(data++) SUBS r1,r1,#1 ; i-- and set flags ADD r0, r3, r0 ; sum += r3 BNE checksum_v6_loop ; if (i!=0) goto loop MOV pc,r14 ; return sum int checksum_v6(int *data) { unsigned int i; int sum=0; for (i=64; i!=0; i--) { sum +=*(data++); } return sum; }

LOOP UNROLLING checksum_v7 MOV r2,#0 ; sum=0 checksum_v6_loop LDR r3,[r2],#4 ; r3 = *(data++) SUBS r1,r1,#4 ; N -=4 and set flags ADD r2, r3, r2 ; sum += r3 LDR r3,[r2],#4 ; r3 = *(data++) ADD r2, r3, r2 ; sum += r3 LDR r3,[r2],#4 ; r3 = *(data++) ADD r2, r3, r2 ; sum += r3 LDR r3,[r2],#4 ; r3 = *(data++) ADD r2, r3, r2 ; sum += r3 BNE checksum_v6_loop ; if (N!=0) goto loop MOV r0,r2 ; r0 = sum MOV pc,r14 ; return r0 int checksum_v7(int *data,unsigned int N) { int sum=0; do { sum +=*(data++); sum +=*(data++); sum +=*(data++); sum +=*(data++); N -=4 } while (N!=0); return sum; }

Loop Unrolling example • Unroll the following loop by a factor of 2, 4, and eight for (i=0; i<64; i++) { a[i] = b[i] + c[i+1]; }

Factor of 2 for (i=0; i<32; i++) { a[2*i] = b[2*i] + c[2*i+1]; a[2*i+1] = b[2*i+1] + c[2*i+1+1]; }

Factor of 4 for (i=0; i<16; i++) { a[4*i] = b[4*i] + c[4*i+1]; a[4*i+1] = b[4*i+1] + c[4*i+1+1]; a[4*i+2] = b[4*i+2] + c[4*i+2+1]; a[4*i+3] = b[4*i+3] + c[4*i+3+1]; }

Factor of 8 for (i=0; i<8; i++) { a[8*i] = b[8*i] + c[8*i+1]; a[8*i+1] = b[8*i+1] + c[8*i+1+1]; a[8*i+2] = b[8*i+2] + c[8*i+2+1]; a[8*i+3] = b[8*i+3] + c[8*i+3+1]; a[8*i+4] = b[8*i+4] + c[8*i+4+1]; a[8*i+5] = b[8*i+5] + c[8*i+5+1]; a[8*i+6] = b[8*i+6] + c[8*i+6+1]; a[8*i+7] = b[8*i+7] + c[8*i+7+1]; }

REGISTER ALLOCATION • Limit the number of local variables in the internal loop of functions to 12 • Use the important variables in the innermost loop to help the compiler

CALLING FUNCTIONS • Try to restrict functions to four arguments. Use structures to group related arguments and pass structure pointers instead • Define small functions in the same source file and before the functions that call them.

REGISTER ALLOCATION • Limit the number of internal loop variables to 12 so they can be stored in registers

SUMMARY • Use signed int and unsigned int types for local variables, function arguments and return values • The most efficient form of loop is the do-while loop that counts down to zero • Unroll important loops • Try to limit functions to four arguments. • Avoid divisions. Use multiplication by reciprocal • Use the inline assembler

ARM INLINE ASSEMBLY int main() { int n1,n2,m; n1=5; n2=3; __asm //inline assembly code { MUL m,n1,n2 } printf("The result is %d\n",m); return(0); }

USING INLINE ASSEMBLY • Used for ARM instructions not supported by the C compiler (coprocessor instruction set extensions) • Creates portability issues

ALTERNATIVE: CALLING ASSEMBLY FUNCTION FROM C #include <stdio.h> extern void multip(int n1, int n2, int m); int main() { int n1,n2,m; n1=5; //Assigning numbers n2=3; multip(n1,n2,m); //calling function printf("The result is\n",m); }

Assembly function AREA example, CODE, READONLY EXPORT multip ;external function name IMPORT n1 ;input IMPORT n2 IMPORT m ;return variable Multip ;function begins LDR r3,=n1 ;load data from memory to registers LDR r1,[r3] LDR r4,=n2 LDR r2,[r4] LDR r5,=m LDR r0,[r5] MUL r0,r1,r2 STR r0,[r5] ;store result to m memory location MOV pc,lr ;return from call END

PORTABILITY ISSUES • Char type: Unsigned on ARM, signed on many other processors • Alignment: ARM lw, sw instructions assume the address is a multiple of the type you are loading or storing • Endianess: Little endian (default), can be configured to big endian • Inline assembly: Separate inline assembly into small inlined functions

EXAMPLE • Write a program that reads 8-element row and column vectors from memory and • Multiplies both by a scalar also found in memory • Calculates the scalar product of the two vectors • Assume no partial product may exceed 32 bits • Use v1= [1 2 3 4 5 6 7 8], v2= [0 1 2 3 4 5 6 7]T, s=5 as test inputs • Unroll the loop by two and four • Repeat using inline assembly for the multiplications

OPTIMIZING C CODE FOR THE ARM PROCESSOR

OPTIMIZING C CODE FOR THE ARM PROCESSOR

Presentation Transcript

ARM Processor cores

ARM Processor Architecture

Introduction to the ARM processor

ARM Processor Overview

Appendix D The ARM Processor

The ARM Processor

ARM Processor Architecture (II)

Optimizing Procedural Code

Optimizing ARM Assembly

ARM Processor Architecture

Optimizing Compiler for the Cell Processor

SEMINAR ON ARM PROCESSOR

ARM Processor Architecture (I)

Optimizing ARM Assembly

Optimizing ARM Assembly

Optimizing ARM Assembly

ARM Processor Architecture (II)

ARM Processor Architecture (I)

Appendix D The ARM Processor