1 / 13

SYSC 5704 Elements of Computer Systems

SYSC 5704 Elements of Computer Systems. Optimization to take advantage of hardware. Fall 2011. SYSC 5704: Elements of Computer Systems. 1. Objectives. Simple things to speed up your program. Optimize Watch procedure calls Code motion Strength reduction Common expression removal .

feryal
Download Presentation

SYSC 5704 Elements of Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SYSC 5704Elements of Computer Systems Optimization to take advantage of hardware. Fall 2011 SYSC 5704: Elements of Computer Systems 1

  2. Objectives Simple things to speed up your program. • Optimize • Watch procedure calls • Code motion • Strengthreduction • Common expression removal.

  3. ExampleMatrix Multiplication Best code 160x This code is not obviously stupid Triple loop Standard desktop computer, compiler, using optimization flags Both implementations have exactly the same operations count (2n3) What is going on?

  4. How didthey do it? • Multiple threads (4x) • Vector instructions (4x) • Memory hierarchy and otheroptimizations (20x) • Blocking or tiling, loopunrolling, arrayscalarization, instruction scheduling. • More instruction levelparallelism, betterregister usage, less L1/L2 cache misses, Less TLB misses.

  5. The effect of naive coding One canlosefrom 10-100x performance or more! • Algorithm (O(n2) vs O(logn) • Coding style (toomanyprocedure calls, reordering, unrolling. • Algorithm structure (locality, instruction levelparallelism). • Data representation. This iswhyweneed to understand computer architecture!

  6. Hint 1 : Use the optimizer! double a[4][4]; double b[4][4]; double c[4][4]; # set to zero /* Multiply 4 x 4 matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) for (k = 0; k < 4; k++) c[i*4+j] += a[i*4 + k]*b[k*4 + j]; } Compiledwithout flags: ~1300 cycles Compiledwith –O3 –m64 -march=… –fno-tree-vectorize~150 cycles Core 2 Duo, 2.66 GHz

  7. Roadblocks The compiler is conservative. Aliasing (pointers) causes troubles. Whole program optimizationistooexpensive.

  8. Hint 2 : Procedure calls Small procedures are better for software engineering, but canbecostly. • Costs go way up if the procedurechecksits arguments. • Check boundsoutside of loop, and design by contract. In-line!

  9. Reducefrequencywithwhich computation isperformed If itwillalwaysproducesameresult Especiallymoving code out of loop Sometimesalsocalledprecomputation or hoisting. Hint 3 : Code Motion void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } long j; int ni = n*i; for (j = 0; j < n; j++) a[ni+j] = b[j];

  10. Strlenloopsuntilitfinds a null, sowe call this over and over again! Move call to strlenoutside of loopsinceresultdoes not change from one iteration to another What about 'A' - 'a'? Even worse example! void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); }

  11. Time quadruples when string lengthis double - Quadratic performance Hoistingresults in linear performance. Performance CPU Seconds (log scale) String Length (log scale)

  12. Replace costlyoperationwithsimpler one Example: Shift/addinstead of multiply or divide 16*x → x << 4 Utility machine dependent Depends on cost of multiply or divide instruction On Pentium IV, integermultiplyrequires 10 CPU cycles Example: Recognizesequence of products Hint 4 : Strength Reduction for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; }

  13. Reuse portions of expressions (factoring!) Compilersoften not verysophisticated in exploitingarithmeticproperties Hint 5 : Share Common Subexpressions 3 mults: i*n, (i–1)*n, (i+1)*n 1 mult: i*n /* Sum neighbors of i,j */ up = val[(i-1)*n + j ]; down = val[(i+1)*n + j ]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; int inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right;

More Related