1 / 31

CS 3214 Computer Systems

CS 3214 Computer Systems. Godmar Back. Lecture 9. Announcements. Stay tuned for Exercise 5 Project 2 due Sep 30 Auto-fail rule 2: Need at least Firecracker to blow up to pass class. Some of the following slides are taken with permission from

Download Presentation

CS 3214 Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 3214Computer Systems Godmar Back Lecture 9

  2. Announcements • Stay tuned for Exercise 5 • Project 2 due Sep 30 • Auto-fail rule 2: • Need at least Firecracker to blow up to pass class. CS 3214 Fall 2010

  3. Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html Part 2 Code OPTIMIZATION CS 3214 Fall 2010

  4. Roles of Programmer vs Compiler High-Level • Programmer: • Choice of algorithm, Big-O • Manual application of some optimizations • Choice of program structure that’s amenable to optimization • Avoidance of “optimization blockers” Programmer Compiler Low-Level CS 3214 Fall 2010

  5. Roles of Programmer vs Compiler High-Level • Optimizing Compiler • Applies transformations that preserve semantics, but reduce amount of, or time spent in computations • Provides efficient mapping of code to machine: • Selects and orders code • Performs register allocation • Usually consists of multiple stages Programmer Compiler Low-Level CS 3214 Fall 2010

  6. Eliminating Memory Accesses, Take 1 • Registers are faster than memory double sp1(double *x, double *y) { double sum = *x * *x + *y * *y; double diff = *x * *x - *y * *y; return sum * diff; } How many memory accesses? sp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 mulsd %xmm1, %xmm1 mulsd %xmm2, %xmm2 movapd %xmm1, %xmm0 subsd %xmm2, %xmm1 addsd %xmm2, %xmm0 mulsd %xmm1, %xmm0 ret Number of memory accesses not related to how often pointerdereferences occur in source code CS 3214 Fall 2010

  7. Eliminating Memory Accesses, Take 2 • Order of accesses matters void sp1(double *x, double *y, double *sum, double *prod) { *sum = *x + *y; *prod = *x * *y; } How many memory accesses? sp1: movsd (%rdi), %xmm0 addsd (%rsi), %xmm0 movsd %xmm0, (%rdx) movsd (%rdi), %xmm0 mulsd (%rsi), %xmm0 movsd %xmm0, (%rcx) ret CS 3214 Fall 2010

  8. Eliminating Memory Accesses, Take 3 • Compiler doesn’t know that sum or prod will never point to same location as x or y! void sp2(double *x, double *y, double *sum, double *prod) { double xlocal = *x; double ylocal = *y; *sum = xlocal + ylocal; *prod = xlocal * ylocal; } How many memory accesses? sp2: movsd (%rdi), %xmm0 movsd (%rsi), %xmm2 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm0 addsd %xmm2, %xmm1 movsd %xmm1, (%rdx) movsd %xmm0, (%rcx) ret CS 3214 Fall 2010

  9. Inlining • Substitute body of called function into the caller • *before subsequent optimizations are applied* • Current compilers do this aggressively • Almost never a need for doing this manually (e.g., via #define) CS 3214 Fall 2010

  10. Inlining Example void sp1(double *x, double *y, double *sum, double *prod) { *sum = *x + *y; *prod = *x * *y; } double outersp1(double *x, double *y) { double sum, prod; sp1(x, y, &sum, &prod); return sum > prod ? sum : prod; } outersp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 movapd %xmm1, %xmm0 mulsd %xmm2, %xmm1 addsd %xmm2, %xmm0 maxsd %xmm1, %xmm0 ret CS 3214 Fall 2010

  11. length 0 1 2 length–1 data    Case Study: Vector ADT • Procedures vec_ptrnew_vec(intlen) • Create vector of specified length intget_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) • Return pointer to start of vector data • Similar to array implementations in Pascal, ML, Java • E.g., always do bounds checking CS 3214 Fall 2010

  12. Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of vector • Store result at destination location CS 3214 Fall 2010

  13. Time Scales • Absolute Time • Typically use nanoseconds: 10–9seconds • Time scale of computer instructions • Clock Cycles Example: rlogin cluster machines: 2GHz 2 X 109 cycles per second • Clock period = 0.5ns • Most modern architectures provide way to directly read cycle counter: “TSC” – “time stamp counter” • But: can be tricky because it captures OS interaction as well CS 3214 Fall 2010

  14. Cycles Per Element • Convenient way to express performance of program that operators on vectors or lists Length = n  T = CPE*n + Overhead vsum1 Slope = 4.0 vsum2 Slope = 3.5 CS 3214 Fall 2010

  15. Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of integer vector • Store result at destination location • Vector data structure and operations defined via abstract data type • Pentium II/III Performance: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2) CS 3214 Fall 2010

  16. Understanding Loop void combine1-goto(vec_ptr v, int *dest) { int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: } • Inefficiency • Procedure vec_length called every iteration • Even though result always the same 1 iteration CS 3214 Fall 2010

  17. Move vec_length Call Out of Loop void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Code motion • CPE: 20.66 (Compiled -O2) • vec_length requires only constant time, but significant overhead CS 3214 Fall 2010

  18. Code Motion Example #2 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } • Convert string from upper to lower • Here: asymptotic complexity becomes O(n^2)! CS 3214 Fall 2010

  19. Lower Case Conversion Performance • Time quadruples when double string length • Quadratic performance CS 3214 Fall 2010

  20. Performance after Code Motion • Time doubles when double string length • Linear performance CS 3214 Fall 2010

  21. Optimization Blocker: Procedure Calls • Why couldn’t the compiler move vec_len or strlen out of the inner loop? • Procedure may have side effects • Alters global state each time called • Function may not return same value for given arguments • Depends on other parts of global state • Procedure lower could interact with strlen • What if compiler looks at code? Or inlines them? • even then, compiler may not be able to prove that the same result is obtained, or the possibility of aliasing may require repeating the operation; and compiler must preserve any side-effects • interproceduraloptimization is expensive, but compilers are continuously getting better at it • For instance, take into account if a function reads or writes to global memory • Today’s compilers are different from the compilers 5 years ago and will be different from those 5 years from now CS 3214 Fall 2010

  22. Remove Bounds Checking void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • Procedure calls are expensive! • Bounds checking is expensive CS 3214 Fall 2010

  23. Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive! CS 3214 Fall 2010

  24. Detecting Unneeded Memory Refs. Combine3 Combine4 .L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl .L18 .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 • Performance • Combine3 • 5 instructions in 6 clock cycles • addl must read and write memory • Combine4 • 4 instructions in 2 clock cycles CS 3214 Fall 2010

  25. Pointer Code void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } Big question: Should you rewrite your array code as pointer code to “help” the compiler? • Optimization • Use pointers rather than array references • CPE: 3.00 (Compiled -O2) • Oops! Worse than the best array version Warning: Some compilers do better job optimizing array code CS 3214 Fall 2010

  26. Pointer vs. Array Code Inner Loops .L24: # Loop: addl (%eax,%edx,4),%ecx # sum += data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop • Array Code • Pointer Code • Performance • Array Code: 4 instructions in 2 clock cycles • Pointer Code: Almost same 4 instructions in 3 clock cycles .L30: # Loop: addl (%eax),%ecx # sum += *data addl $4,%eax # data ++ cmpl %edx,%eax # data:dend jb .L30 # if < goto Loop CS 3214 Fall 2010

  27. Pointer vs. Array Code • Difficult to predict which would be faster • Compiler may transform array to pointer form if it deems it useful • Compiler as a rule optimizes array code as good or better as it does pointer code • Writing as array code allows use of index variable in index-based address modes • Should prefer array form for readability CS 3214 Fall 2010

  28. Lessons so far (1) • Does not matter how many local variables or temporaries you introduce • Does not matter if you use constants, expressions, or const local variables, or write-once local variables • So optimize for readability, not the compiler • Does not matter how many pointer derefs you have in your code (*, [ ], ->) as long as there’s no intervening write/store to memory • If there is, compiler must repeat the ‘load’ • Avoid introducing ‘stores’ by introducing local temporaries that defer the write to memory whenever possible • Don’t rewrite array code into pointer form CS 3214 Fall 2010

  29. Lessons so far (2) • Inlining changes the game substantially • Compiler will aggressively inline functions whose definitions occur in same compilation unit • Does not matter if declared ‘static’ or not; but must be static if included in multiple files to avoid multiple strong symbols • Can remove abstraction penalty entirely in many cases • No need for manual inlining, using macros • Inlining can generate better code because it enables optimizations not possible without knowing the caller: • potential for aliasing of pointer arguments may be reduced, allowing for more precise and less-conservative points-to analysis • May be able to remove bounds-checks even (next slide) • Caveat: inlining is not possible if target of the call is not known to the compiler • E.g. non-final, non-private methods in Java, or “virtual” methods in C++; so declare your methods final or private in Java whenever possible CS 3214 Fall 2010

  30. combine1 Example under inlining /* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */ int get_vec_element(vec_ptr v, int index, data_t *dest) { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; } /* Return length of vector */ int vec_length(vec_ptr v) { return v->len; } void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of vector • Store result at destination location CS 3214 Fall 2010

  31. combine1: pushl %ebp movl %esp, %ebp movl 12(%ebp), %ecx pushl %esi movl 8(%ebp), %esi pushl %ebx movl $0, (%ecx) movl (%esi), %eax testl %eax, %eax jle .L375 movl 4(%esi), %ebx xorl %edx, %edx .p2align 4,,7 .L374: movl (%ebx,%edx,4), %eax addl $1, %edx addl %eax, (%ecx) cmpl %edx, (%esi) jg .L374 .L375: popl %ebx popl %esi popl %ebp ret Form after inlining void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < v->len; i++) { int val; if (i < 0 || i >= v->len) // become redundant! { ret = 0; goto skip; } val = v->data[index]; ret = 1; skip: /* caller ignored return value */ *dest += val; } } CS 3214 Fall 2010

More Related