1 / 22

Code Optimization II: Machine Independent Optimizations

Code Optimization II: Machine Independent Optimizations. Systems I. Topics Machine-Independent Optimizations Code motion Reduction in strength Common subexpression sharing Tuning Identifying performance bottlenecks. length. 0. 1. 2. length–1. data.   . Vector ADT. Procedures

jude
Download Presentation

Code Optimization II: Machine Independent Optimizations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Optimization II:Machine Independent Optimizations Systems I • Topics • Machine-Independent Optimizations • Code motion • Reduction in strength • Common subexpression sharing • Tuning • Identifying performance bottlenecks

  2. length 0 1 2 length–1 data    Vector ADT • Procedures vec_ptr new_vec(int len) • Create vector of specified length int get_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) • Return pointer to start of vector data • Similar to array implementations in Pascal, ML, Java • E.g., always do bounds checking

  3. Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of integer vector • Store result at destination location • Vector data structure and operations defined via abstract data type • Pentium II/III Performance: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2)

  4. Reduction in Strength void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • Procedure calls are expensive! • Bounds checking is expensive

  5. Eliminate Unneeded Memory Refs void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive!

  6. Detecting Unneeded Memory Refs. Combine2 Combine3 • Performance • Combine2 • 5 instructions in 6 clock cycles • addl must read and write memory • Combine3 • 4 instructions in 2 clock cycles .L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl .L18 .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24

  7. Optimization Blocker: Memory Aliasing • Aliasing • Two different memory references specify single location • Example • v: [3, 2, 17] • combine2(v, get_vec_start(v)+2) --> ? • combine3(v, get_vec_start(v)+2) --> ? • Observations • Easy to have happen in C • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Accumulating within loops • Your way of telling compiler not to check for aliasing

  8. Previous Best Combining Code void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Task • Compute sum of all elements in vector • Vector represented by C-style abstract data type • Achieved CPE of 2.00 • Cycles per element

  9. Data Types Use different declarations for data_t int float double Operations Use different definitions of OP and IDENT + / 0 * / 1 General Forms of Combining void abstract_combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t; }

  10. Machine Independent Opt. Results • Optimizations • Reduce function calls and memory references within loop • Performance Anomaly • Computing FP product of all elements exceptionally slow. • Very large speedup when accumulate in temporary • Caused by quirk of IA32 floating point • Memory uses 64-bit format, register use 80 • Benchmark data caused overflow of 64 bits, but not 80

  11. Pointer Code void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } • Optimization • Use pointers rather than array references • CPE: 3.00 (Compiled -O2) • Oops! We’re not making progress here! Warning: Some compilers do better job optimizing array code

  12. Pointer vs. Array Code Inner Loops • Array Code • Pointer Code • Performance • Array Code: 4 instructions in 2 clock cycles • Pointer Code: Almost same 4 instructions in 3 clock cycles .L24: # Loop: addl (%eax,%edx,4),%ecx # sum += data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L30: # Loop: addl (%eax),%ecx # sum += *data addl $4,%eax # data ++ cmpl %edx,%eax # data:dend jb .L30 # if < goto Loop

  13. Machine-Independent Opt. Summary • Code Motion • Compilers are good at this for simple loop/array structures • Don’t do well in presence of procedure calls and memory aliasing • Reduction in Strength • Shift, add instead of multiply or divide • compilers are (generally) good at this • Exact trade-offs machine-dependent • Keep data in registers rather than memory • compilers are not good at this, since concerned with aliasing • Share Common Subexpressions • compilers have limited algebraic reasoning capabilities

  14. Important Tools • Measurement • Accurately compute time taken by code • Most modern machines have built in cycle counters • Using them to get reliable measurements is tricky • Profile procedure calling frequencies • Unix tool gprof • Observation • Generating assembly code • Lets you see what optimizations compiler can make • Understand capabilities/limitations of particular compiler

  15. Code Profiling Example • Task • Count word frequencies in text document • Produce sorted list of words from most frequent to least • Steps • Convert strings to lowercase • Apply hash function • Read words and insert into hash table • Mostly list operations • Maintain counter for each unique word • Sort results • Data Set • Collected works of Shakespeare • 946,596 total words, 26,596 unique • Initial implementation: 9.2 seconds • Shakespeare’s • most frequent words

  16. Code Profiling • Augment Executable Program with Timing Functions • Computes (approximate) amount of time spent in each function • Time computation method • Periodically (~ every 10ms) interrupt program • Determine what function is currently executing • Increment its timer by interval (e.g., 10ms) • Also maintains counter for each function indicating number of times called • Using gcc –O2 –pg prog.c –o prog ./prog • Executes in normal fashion, but also generates file gmon.out gprof prog • Generates profile information based on gmon.out

  17. Profiling Results % cumulative self self total time seconds seconds calls ms/call ms/call name 86.60 8.21 8.21 1 8210.00 8210.00 sort_words 5.80 8.76 0.55 946596 0.00 0.00 lower1 4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec 1.27 9.33 0.12 946596 0.00 0.00 h_add • Call Statistics • Number of calls and cumulative time for each function • Performance Limiter • Using inefficient sorting algorithm • Single call uses 87% of CPU time

  18. Code Optimizations • First step: Use more efficient sorting function • Library function qsort

  19. Further Optimizations • Iter first: Use iterative function to insert elements into linked list • Causes code to slow down • Iter last: Iterative function, places new entry at end of list • Tend to place most common words at front of list • Big table: Increase number of hash buckets • Better hash: Use more sophisticated hash function • Linear lower: Move strlen out of loop

  20. Profiling Observations • Benefits • Helps identify performance bottlenecks • Especially useful when have complex system with many components • Limitations • Only shows performance for data tested • E.g., linear lower did not show big gain, since words are short • Quadratic inefficiency could remain lurking in code • Timing mechanism fairly crude • Only works for programs that run for > 3 seconds

  21. Role of Programmer How should I write my programs, given that I have a good, optimizing compiler? • Don’t: Smash Code into Oblivion • Hard to read, maintain, & assure correctness • Do: • Select best algorithm • Write code that’s readable & maintainable • Procedures, recursion, without built-in constant limits • Even though these factors can slow down code • Eliminate optimization blockers • Allows compiler to do its job • Focus on Inner Loops • Do detailed optimizations where code will be executed repeatedly • Will get most performance gain here

  22. Summary • Today • Optimization blocker: procedure calls • Optimization blocker: memory aliasing • Tools (profiling) for understanding performance • Next time • Memory system optimization

More Related