1 / 35

Code Optimization

Code Optimization. Outline. Machine-Independent Optimization Code motion Memory optimization Suggested reading 5.2 ~ 5.6. Motivation. Constant factors matter too! easily see 10:1 performance range depending on how code is written must optimize at multiple levels

Download Presentation

Code Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Optimization

  2. Outline • Machine-Independent Optimization • Code motion • Memory optimization • Suggested reading • 5.2 ~ 5.6

  3. Motivation • Constant factors matter too! • easily see 10:1 performance range depending on how code is written • must optimize at multiple levels • algorithm, data representations, procedures, and loops

  4. Motivation • Must understand system to optimize performance • how programs are compiled and executed • how to measure program performance and identify bottlenecks • how to improve performance without destroying code modularity and generality

  5. 5.2 Expressing Program Performance

  6. Time Scales P382 • Absolute Time • Typically use nanoseconds • 10–9 seconds • Time scale of computer instructions

  7. Time Scales • Clock Cycles • Most computers controlled by high frequency clock signal • Typical Range • 100 MHz • 108 cycles per second • Clock period = 10ns • 2 GHz • 2 X 109 cycles per second • Clock period = 0.5ns

  8. CPE P383 1 void vsum1(int n) 2 { 3 int i; 4 5 for (i = 0; i < n; i++) 6 c[i] = a[i] + b[i]; 7 } 8

  9. CPE P383 9 /* Sum vector of n elements (n must be even) */ 10 void vsum2(int n) 11 { 12 int i; 13 14 for (i = 0; i < n; i+=2) { 15 /* Compute two elements per iteration */ 16 c[i] = a[i] + b[i]; 17 c[i+1] = a[i+1] + b[i+1]; 18 } 19 }

  10. Cycles Per Element • Convenient way to express performance of program that operators on vectors or lists • Length = n • T = CPE*n + Overhead

  11. vsum1 Slope = 4.0 vsum2 Slope = 3.5 Cycles Per Element Figure 5.2 P383

  12. 5.3 Program Example5.4 Eliminating Loop Inefficiencies5.5 Reducing Procedure Calls5.6 Eliminating Unneeded Memory References

  13. 5.3 Program Example

  14. length 0 1 2 length–1 data    Vector ADT P384 typedef struct { int len ; data_t *data ; } vec_rec, *vec_ptr ; typedef int data_t ; Figure 5.3 P385

  15. Procedures P386 • vec_ptr new_vec(int len) • Create vector of specified length • data_t *get_vec_start(vec_ptr v) • Return pointer to start of vector data P386 P392

  16. Procedures • int get_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful • Similar to array implementations in Pascal, Java • E.g., always do bounds checking P386

  17. Vector ADT vec_ptr new_vec(int len) { /* allocate header structure */ vec_ptr result = (vec_ptr) malloc(sizeof(vec_rec)) ; if ( !result ) return NULL ;

  18. Vector ADT /* allocate array */ if ( len > 0 ) { data_t *data = (data_t *)calloc(len, sizeof(data_t)) ; if ( !data ) { free( (void *)result ) ; return NULL ; /* couldn’t allocte stroage */ } result->data = data } else result->data = NULL return result ; }

  19. Vector ADT /* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */ int get_vec_element(vec_ptr v, int index, data_t *dest) { if ( index < 0 || index >= v->len) return 0 ; *dest = v->data[index] ; return 1; }

  20. Vector ADT /* Return length of vector */ int vec_length(vec_ptr) { return v->len ; } /* Return pointer to start of vector data */ data_t *get_vec_start(vec_ptr v) { return v->data ; } P392

  21. Optimization Example P385 #ifdef ADD #define IDENT 0 #define OPER + #else #define IDENT 1 #define OPER * #endif

  22. Optimization Example P387 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; } }

  23. Optimization Example • Procedure • Compute sum (product) of all elements of vector • Store result at destination location

  24. 5.3 Program Example Time Scales P387 void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } }

  25. Time Scales P385 • Procedure • Compute sum of all elements of integer vector • Store result at destination location • Vector data structure and operations defined via abstract data type • Pentium II/III Performance: CPE • 42.06 (Compiled -g) 31.25 (Compiled -O2)

  26. 1 iteration Understanding Loop void combine1-goto(vec_ptr v, int *dest) { int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: }

  27. 5.4 Eliminating Loop Inefficiencies Inefficiency • Procedure vec_length called every iteration • Even though result always the same

  28. Code Motion P388 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } }

  29. Code Motion P388 • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Code motion • CPE: 22.61 (Compiled -O2) • vec_length requires only constant time, but significant overhead

  30. 5.5 Reducing Procedure Calls Reduction in Strength P392 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; }

  31. Reduction in Strength • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • Procedure calls are expensive! • Bounds checking is expensive

  32. 5.6 Eliminating Unneeded Memory References Eliminate Unneeded Memory References P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }

  33. Eliminate Unneeded Memory References • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive!

  34. Detecting Unneeded Memory References Combine3 Combine4 .L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl .L18 .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24

  35. Detecting Unneeded Memory References • Performance • Combine3 • 5 instructions in 6 clock cycles • addl must read and write memory • Combine4 • 4 instructions in 2 clock cyles

More Related