1 / 28

More Code Optimization

More Code Optimization. Outline. Memory Performance Tuning Performance Suggested reading 5.12 ~ 5.14. Load Performance. load unit can only initiate one load operation every clock cycle (Issue=1.0). typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ;

xena
Download Presentation

More Code Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More Code Optimization

  2. Outline • Memory Performance • Tuning Performance • Suggested reading • 5.12 ~ 5.14

  3. Load Performance • load unit can only initiate one load operation every clock cycle (Issue=1.0) typedef struct ELE { struct ELE *next ; int data ; } list_ele, *list_ptr ; int list_len(list_ptr ls) { int len = 0 ; while (ls) { len++ ; ls = ls->next; } return len ; } len in %eax, ls in %rdi .L11: addl $1, %eax movq (%rdi), %rdi testq %rdi, %rdi jne .L11 Function CPE list_len 4.0 load latency 4.0

  4. Store Performance • store unit can only initiate one store operation every clock cycle (Issue=1.0) void array_clear(int *dest, int n) { int i; for (i = 0; i < n; i++) dest[i] = 0; } Function CPE array_clear 2.0

  5. Store Performance • store unit can only initiate one store operation every clock cycle (Issue=1.0) void array_clear_4(int *dest, int n) { int i; int limit = n-3; for (i = 0; i < limit; i+=4) { dest[i] = 0; dest[i+1] = 0; dest[i+2] = 0; dest[i+3] = 0; } for ( ; i < n; i++) dest[i] = 0; } Function CPE array_clear_4 1.0

  6. Store Performance void write_read(int *src, int *dest, int n) { int cnt = n; int val = 0; while (cnt--) { *dest = val; val = (*src)+1; } } Example A: write_read(&a[0],&a[1],3) iter2 iter1 iter2 initial iter3 initial iter1 iter3 cnt 3 2 2 3 1 1 0 0 a -10 -10 1 0 -10 -10 -10 2 17 17 17 17 -9 -9 17 0 val 0 1 2 0 -9 -9 -9 3 Example B: write_read(&a[0],&a[0],3) cnt a Function CPE Example A 2.0 Example B 6.0 val

  7. Load and Store Units Store Unit Load Unit Store buffer address data address Matching addresses Data Address Data Address Data Data Cache

  8. Graphical Representation %eax %ebx %ecx %edx s_addr movl %eax,(%ecx) s_data movl (%ebx), %eax load t addl $1,%eax add subl $1,%edx sub jne loop jne %eax %ebx %ecx %edx //inner-loop while (cnt--) { *dest = val; val = (*src)+1; }

  9. Graphical Representation %eax %ebx %ecx %edx %eax %edx s_addr S-data 2 3 1 s_data load sub load sub add jg add %eax %edx %edx %eax

  10. Graphical Representation Example B Example A Critical Path S_data load load mul sub sub add mul S_data load load mul Function CPE Example A 2.0 Example B 6.0 sub sub add mul

  11. Getting High Performance • High-level design • Choose appropriate algorithms and data structures for the problem at hand • Be especially vigilant to avoid algorithms or coding techniques that yield asymptotically poor performance

  12. Getting High Performance • Basic coding principles • Avoid optimization blockers so that a compiler can generate efficient code. • Eliminate excessive function calls • Move computations out of loops when possible • Consider selective compromises of program modularity to gain greater efficiency • Eliminate unnecessary memory references. • Introduce temporary variables to hold intermediate results • Store a result in an array or global variable only when the final value has been computed.

  13. Getting High Performance • Low-level optimizations • Unroll loops to reduce overhead and to enable further optimizations • Find ways to increase instruction-level parallelism by techniques such as multiple accumulators and reassociation • Rewrite conditional operations in a functional style to enable compilation via conditional data transfers • Write cache friendly code

  14. Performance Tuning • Identify • Which is the hottest part of the program • Using a very useful method profiling • Instrument the program • Run it with typical input data • Collect information from the result • Analysis the result

  15. Examples unix> gcc –O1 –pg prog.c –o prog unix> ./prog file.txt unix> gprof prog % cumulative self self total time seconds seconds calls s/call s/call name 97.58 173.05 173.05 1 173.05 173.05 sort_words 2.36 177.24 4.19 965027 0.00 0.00 find_ele_rec 0.12 177.46 0.22 12511031 0.00 0.00 Strlen

  16. Principle • Interval counting • Maintain a counter for each function • Record the time spent executing this function • Interrupted at regular time (1ms) • Check which function is executing when interrupt occurs • Increment the counter for this function • The calling information is quite reliable • By default, the timings for library functions are not shown

  17. Program Example • Task • Analyzing the n-gram statistics of a text document • an n-gram is a sequence of n words occurring in a document • reads a text file, • creates a table of unique n-grams • specifying how many times each one occurs • sorts the n-grams in descending order of occurrence

  18. Program Example • Steps • Convert strings to lowercase • Apply hash function • Read n-grams and insert into hash table • Mostly list operations • Maintain counter for each unique n-gram • Sort results • Data Set • Collected works of Shakespeare • 965,028 total words, 23,706 unique • N=2, called bigrams • 363,039 unique bigrams

  19. Example 158655725 find_ele_rec [5] 4.19 0.02 965027/965027 insert_string [4] [5] 2.4 4.19 0.02 965027+158655725 find_ele_rec [5] 0.01 0.01 363039/363039 new_ele [10] 0.00 0.01 363039/363039 save_string [13] 158655725 find_ele_rec [5] • Ratio : 158655725/965027 = 164.4 • The average length of a list in one hash bucket is 164

  20. Code Optimizations • First step: Use more efficient sorting function • Library function qsort

  21. Further Optimizations

  22. Optimizaitons • Iter first: Use iterative function to insert elements in linked list • Causes code to slow down • Iter last: Iterative function, places new entry at end of list • Tend to place most common words at front of list • Big table: Increase number of hash buckets • Better hash: Use more sophisticated hash function • Linear lower: Move strlen out of loop

  23. Code Motion 1 /* Convert string to lowercase: slow */ 2 void lower1(char *s) 3 { 4 int i; 5 6 for (i = 0; i < strlen(s); i++) 7 if (s[i] >= ’A’ && s[i] <= ’Z’) 8 s[i] -= (’A’ - ’a’); 9 } 10

  24. Code Motion 11 /* Convert string to lowercase: faster */ 12 void lower2(char *s) 13 { 14 int i; 15 int len = strlen(s); 16 17 for (i = 0; i < len; i++) 18 if (s[i] >= ’A’ && s[i] <= ’Z’) 19 s[i] -= (’A’ - ’a’); 20 } 21

  25. Code Motion 22 /* Sample implementation of library function strlen */ 23 /* Compute length of string */ 24 size_t strlen(const char *s) 25 { 26 int length = 0; 27 while (*s != ’\0’) { 28 s++; 29 length++; 30 } 31 return length; 32 }

  26. Code Motion

  27. Performance Tuning • Benefits • Helps identify performance bottlenecks • Especially useful when have complex system with many components • Limitations • Only shows performance for data tested • E.g., linear lower did not show big gain, since words are short • Quadratic inefficiency could remain lurking in code • Timing mechanism fairly crude • Only works for programs that run for > 3 seconds

  28. Amdahl’s Law Tnew = (1-)Told + (Told)/k = Told[(1-) + /k] S = Told / Tnew = 1/[(1-) + /k] S = 1/(1-)

More Related