Instructor: Erol Sahin

Program Optimization - Machine independentCENG331: Introduction to Computer Systems11th Lecture Instructor: ErolSahin • Acknowledgement: Most of the slides are adapted from the ones prepared • by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.

Harsh Reality • There’s more to runtime performance than asymptotic complexity • One can easily lose 10x, 100x in runtime or even more • What matters: • Constants (100n and 5n is both O(n), but ….) • Coding style (unnecessary procedure calls, unrolling, reordering, …) • Algorithm structure (locality, instruction level parallelism, …) • Data representation (complicated structs or simple arrays)

Harsh Reality • Must optimize at multiple levels: • Algorithm • Data representations • Procedures • Loops • Must understand system to optimize performance • How programs are compiled and executed • Execution units, memory hierarchy • How to measure program performance and identify bottlenecks • How to improve performance without destroying code modularity and generality

Example Matrix Multiplication • Standard desktop computer, compiler, using optimization flags • Both implementations have exactly the same operations count (2n3) • What is going on? 160x Best code This code is not obviously stupid Triple loop

MMM Plot: Analysis Multiple threads: 4x (in CENG334) (not in this course) Vector instructions: 4x Memory hierarchy and other optimizations: 20x • Reason for 20x: Blocking or tiling, loop unrolling, array scalarization, instruction scheduling, search to find best choice • Effect: more instruction level parallelism, better register use, less L1/L2 cache misses, less TLB misses

Optimizing Compilers -O • Use optimization flags, default is no optimization (-O0)! • Good choices for gcc: -O2, -O3, -march=xxx, -m64 • Try different flags and maybe different compilers

Example double a[4][4]; double b[4][4]; double c[4][4]; # set to zero /* Multiply 4 x 4 matrices a and b */ void mmm(double *a, double *b, double *c, int n) { inti, j, k; for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) for (k = 0; k < 4; k++) c[i*4+j] += a[i*4 + k]*b[k*4 + j]; } • Compiled without flags: ~1300 cycles • Compiled with –O3 –m64 -march=… –fno-tree-vectorize~150 cycles • Core 2 Duo, 2.66 GHz

Optimizing Compilers • Compilers are good at: mapping program to machine • register allocation • code selection and ordering (scheduling) • dead code elimination • eliminating minor inefficiencies • Compilers are not good at: improving asymptotic efficiency • up to programmer to select best overall algorithm • big-O savings are (often) more important than constant factors • but constant factors also matter • Compilers are not good at: overcoming “optimization blockers” • potential memory aliasing • potential procedure side-effects

Limitations of Optimizing Compilers • If in doubt, the compiler is conservative • Operate under fundamental constraints • Must not change program behavior under any possible condition • Often prevents it from making optimizations when would only affect behavior under pathological conditions. • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles • e.g., data ranges may be more limited than variable types suggest • Most analysis is performed only within procedures • Whole-program analysis is too expensive in most cases • Most analysis is based only on static information • Compiler has difficulty anticipating run-time inputs

Machine-Independent Optimizations • Optimizations you should do regardless of processor / compiler Code Motion • Reduce frequency with which computation performed • If it will always produce same result • Especially moving code out of loop for (i = 0; i < n; i++) { intni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Compiler-Generated Code Motion • Most compilers do a good job with array code + simple loop structures Code Generated by GCC for (i = 0; i < n; i++) { intni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4) # Inner Loop .L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<n

Reduction in Strength • Replace costly operation with simpler one • Shift, add instead of multiply or divide 16*x --> x << 4 • Utility machine dependent • Depends on cost of multiply or divide instruction • On Pentium II or III, integer multiply only requires 4 CPU cycles • Recognize sequence of products intni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Make Use of Registers • Reading and writing registers much faster than reading/writing memory Limitation • Compiler not always able to determine whether variable can be held in register • Possibility of Aliasing • See example later

Machine-Independent Opts. (Cont.) Share Common Subexpressions • Reuse portions of expressions • Compilers often not very sophisticated in exploiting arithmetic properties /* Sum neighbors of i,j */ up = val[(i-1)*n + j]; down = val[(i+1)*n + j]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; intinj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; 3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leal -1(%edx),%ecx # i-1 imull %ebx,%ecx # (i-1)*n leal 1(%edx),%eax # i+1 imull %ebx,%eax # (i+1)*n imull %ebx,%edx # i*n

Vector ADT Procedures vec_ptrnew_vec(intlen) • Create vector of specified length intget_vec_element(vec_ptrv, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptrv) • Return pointer to start of vector data • Similar to array implementations in Pascal, ML, Java • E.g., always do bounds checking 1 len-1 0 len data

Optimization Example void combine1(vec_ptr v, int *dest) { inti; *dest = 0; for (i = 0; i < vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of vector • Store result at destination location

Time Scales Absolute Time • Typically use nanoseconds • 10–9 seconds • Time scale of computer instructions Clock Cycles • Most computers controlled by high frequency clock signal • Typical Range • 100 MHz • 108 cycles per second • Clock period = 10ns • 2 GHz • 2 X 109 cycles per second • Clock period = 0.5ns

Cycles Per Element • Convenient way to express performance of program that operators on vectors or lists • Length = n • T = CPE*n + Overhead vsum1 Slope = 4.0 vsum2 Slope = 3.5

Optimization Example void combine1(vec_ptr v, int *dest) { inti; *dest = 0; for (i = 0; i < vec_length(v); i++) { intval; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of integer vector • Store result at destination location • Vector data structure and operations defined via abstract data type • Pentium II/III Performance: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2)

Understanding Loop void combine1-goto(vec_ptr v, int *dest) { inti = 0; intval; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: } • Inefficiency • Procedure vec_length called every iteration • Even though result always the same 1 iteration

Move vec_length Call Out of Loop void combine2(vec_ptr v, int *dest) { inti; intlength = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { intval; get_vec_element(v, i, &val); *dest += val; } } • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Code motion • CPE: 20.66 (Compiled -O2) • vec_lengthrequires only constant time, but significant overhead

Code Motion Example #2 Procedure to Convert String to Lower Case void lower(char *s) { inti; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); }

Lower Case Conversion Performance • Time quadruples when double string length • Quadratic performance

Convert Loop To Goto Form void lower(char *s) { int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done: } • strlenexecuted every iteration • strlen linear in length of string • Must scan string until finds '\0' • Overall performance is quadratic

Improving Performance void lower(char *s) { inti; intlen = strlen(s); for (i = 0; i <len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } • Move call to strlen outside of loop • Since result does not change from one iteration to another • Form of code motion

Lower Case Conversion Performance • Time doubles when double string length • Linear performance

Optimization Blocker: Procedure Calls Why couldn’t the compiler move vec_len or strlen out of the inner loop? • Procedure may have side effects • Alters global state each time called • Function may not return same value for given arguments • Depends on other parts of global state • Procedure lower could interact with strlen Why doesn’t compiler look at code for vec_len or strlen? • Linker may overload with different version • Unless declared static • Interprocedural optimization is not used extensively due to cost Warning: • Compiler treats procedure call as a black box • Weak optimizations in and around them

Reduction in Strength void combine3(vec_ptr v, int *dest) { inti; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • Procedure calls are expensive! • Bounds checking is expensive

Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { inti; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive!

Detecting Unneeded Memory Refs. Combine3 Combine4 • Performance • Combine3 • 5 instructions in 6 clock cycles • addl must read and write memory • Combine4 • 4 instructions in 2 clock cycles .L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl .L18 .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24

Optimization Blocker: Memory Aliasing • Aliasing • Two different memory references specify single location • Example • v: [3, 2, 17] • combine3(v, get_vec_start(v)+2) --> ? • combine4(v, get_vec_start(v)+2) --> ? • Observations • Easy to have happen in C • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Accumulating within loops • Your way of telling compiler not to check for aliasing

Machine-Independent Opt. Summary Code Motion • Compilers are good at this for simple loop/array structures • Don’t do well in presence of procedure calls and memory aliasing Reduction in Strength • Shift, add instead of multiply or divide • compilers are (generally) good at this • Exact trade-offs machine-dependent • Keep data in registers rather than memory • compilers are not good at this, since concerned with aliasing Share Common Subexpressions • compilers have limited algebraic reasoning capabilities

Important Tools • Measurement • Accurately compute time taken by code • Most modern machines have built in cycle counters • Using them to get reliable measurements is tricky • Profile procedure calling frequencies • Unix tool gprof • Observation • Generating assembly code • Lets you see what optimizations compiler can make • Understand capabilities/limitations of particular compiler

Code Profiling Example Task • Count word frequencies in text document • Produce sorted list of words from most frequent to least Steps • Convert strings to lowercase • Apply hash function • Read words and insert into hash table • Mostly list operations • Maintain counter for each unique word • Sort results Data Set • Collected works of Shakespeare • 946,596 total words, 26,596 unique • Initial implementation: 9.2 seconds • Shakespeare’s • most frequent words

Code Profiling Augment Executable Program with Timing Functions • Computes (approximate) amount of time spent in each function • Time computation method • Periodically (~ every 10ms) interrupt program • Determine what function is currently executing • Increment its timer by interval (e.g., 10ms) • Also maintains counter for each function indicating number of times called Using gcc –O2 –pg prog. –oprog ./prog • Executes in normal fashion, but also generates file gmon.out gprofprog • Generates profile information based on gmon.out

Profiling Results % cumulative self self total time seconds seconds calls ms/call ms/call name 86.60 8.21 8.21 1 8210.00 8210.00 sort_words 5.80 8.76 0.55 946596 0.00 0.00 lower1 4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec 1.27 9.33 0.12 946596 0.00 0.00 h_add Call Statistics • Number of calls and cumulative time for each function Performance Limiter • Using inefficient sorting algorithm • Single call uses 87% of CPU time

Code Optimizations • First step: Use more efficient sorting function • Library function qsort

Further Optimizations • Iter first: Use iterative function to insert elements into linked list • Causes code to slow down • Iter last: Iterative function, places new entry at end of list • Tend to place most common words at front of list • Big table: Increase number of hash buckets • Better hash: Use more sophisticated hash function • Linear lower: Move strlen out of loop

Profiling Observations Benefits • Helps identify performance bottlenecks • Especially useful when have complex system with many components Limitations • Only shows performance for data tested • E.g., linear lower did not show big gain, since words are short • Quadratic inefficiency could remain lurking in code • Timing mechanism fairly crude • Only works for programs that run for > 3 seconds

Program Optimization - Machine DependentCENG331: Introduction to Computer Systems12th Lecture Instructor: ErolSahin • Acknowledgement: Most of the slides are adapted from the ones prepared • by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.

Overview • Topics • Machine-Dependent Optimizations • Pointer code • Unrolling • Enabling instruction level parallelism • Understanding Processor Operation • Translation of instructions into operations • Out-of-order execution of operations • Branches and Branch Prediction

Previous Best Combining Code void combine4(vec_ptr v, int *dest) { inti; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Task • Compute sum of all elements in vector • Vector represented by C-style abstract data type • Achieved CPE of 2.00 • Cycles per element

Data Types Use different declarations for data_t int float double Operations Use different definitions of OP and IDENT + / 0 * / 1 General Forms of Combining void abstract_combine4(vec_ptr v, data_t *dest) { inti; int length = vec_length(v); data_t *data = get_vec_start(v); data_tt = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t; }

Machine Independent Opt. Results Optimizations • Reduce function calls and memory references within loop Performance Anomaly • Computing FP product of all elements exceptionally slow. • Very large speedup when accumulate in temporary • Caused by quirk of IA32 floating point • Memory uses 64-bit format, register use 80 • Benchmark data caused overflow of 64 bits, but not 80

Pointer Code void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } • Optimization • Use pointers rather than array references • CPE: 3.00 (Compiled -O2) • Oops! We’re not making progress here! Warning: Some compilers do better job optimizing array code

Pointer vs. Array Code Inner Loops Array Code Pointer Code Performance • Array Code: 4 instructions in 2 clock cycles • Pointer Code: Almost same 4 instructions in 3 clock cycles .L24: # Loop: addl (%eax,%edx,4),%ecx # sum += data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L30: # Loop: addl (%eax),%ecx # sum += *data addl $4,%eax # data ++ cmpl %edx,%eax # data:dend jb .L30 # if < goto Loop

Modern CPU Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Register File Instruction Decode Instructions Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache

Superscalar Processor • Definition:A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically. • Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelismthat most programs have • Most CPUs since about 1998 are superscalar. • Intel: since Pentium Pro

CPU Capabilities of Pentium III • Multiple Instructions Can Execute in Parallel • 1 load • 1 store • 2 integer (one may be branch) • 1 FP Addition • 1 FP Multiplication or Division • Some Instructions Take > 1 Cycle, but Can be Pipelined • Instruction Latency Cycles/Issue • Load / Store 3 1 • Integer Multiply 4 1 • Integer Divide 36 36 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Double/Single FP Divide 38 38 Latency: # of cycles the functional unit requires to complete an operation Issue time: #of cycles between successive independent operations

Latency versus Throughput • Last slide: latency cycles/issue Integer Multiply 10 1 Step 1 1 cycle Step 2 1 cycle Step 10 1 cycle • Consequence: • How fast can 10 independent intmults be executed? • t1 = t2*t3; t4 = t5*t6; … • How fast can 10 sequentially dependent intmults be executed?t1 = t2*t3; t4 = t5*t1; t6 = t7*t4; … • Major problem for fast execution: Keep pipelines filled

Instructor: Erol Sahin