Code Optimization Roadmap for Orion Sky: Understanding and Improving Performance

Code Optimization Orion Sky Lawlor olawlor@uiuc.edu 2003/9/17

Roadmap • Introduction • gprof • Timer calls • Understanding Performance

Introduction • Scientific Performance Method: • Measure (don’t assume!) • Find the bottlenecks in the code • They aren’t where you expect! • Fix the worst problems first • Consider stopping-- is it good enough? • Fix • Improve algorithms first • Improve implementations second • Repeat (indefinitely)

gprof– UNIX performance tool • Compile and link with “-pg” flag • Adds instrumentation to code • Run serial program normally • Instrumentation runs automatically • Run gprof to analyze trace • `gprof pgm gmon.out` • Shows a function-level view of execution time • Heisenberg measurement error!

Timer Calls • CPU time (virtual time) • Time spent running your code • CmiCpuTimer()– 10ms resolution • Wall clock time (real time) • Includes OS interference, network delays, context-switching overhead • CmiWallTimer()– to 1ns resolution • This is what really counts

How to call the timer: one time • What’s wrong with this? • double s=CmiWallTimer(); • foo(); • double e=CmiWallTimer()-s; • CkPrintf(“foo took %fs\n”,e); • If CmiWallTimer takes 100ns, and foo takes 50ns, this may print 150ns! • Only a problem for very fast functions (or slow timers!)

How to call the timer: repeat • Repetition can decrease apparent timer overhead and increase resolution: • const int n=1000; • double s=CmiWallTimer(); • for (i=0;i<n;i++) foo(); • double e=(CmiWallTimer()-s)/n; • CkPrintf(“foo took %fs\n”,e); • Problem: what if foo’s performance is cache-sensitive?

Understanding Performance: CPU GHz 1ns Integer Arithmetic Floating Point Branches Cache 10ns (int)x Subroutine / Memory / or % 100ns “inline” MHz 1us Bitshifts and masks *(int *)&x 10us 100us Cache-friendly algorithms KHz 1ms

Understanding Performance: OS GHz 1ns Timer 10ns sin, tan,... 100ns Malloc Syscall MHz 1us 10us Tables, identities Reuse buffers 100us Avoid KHz 1ms Disk Timeslice

Understanding Performance: Net GHz 1ns 10ns Message Combining Rethink 100ns Probe MHz 1us RDMA Barrier Message 10us 100us KHz 1ms

Conclusions • Performance is the whole point of parallel programming • painful but necessary • Asymptotics matter– find O(n) • Constants matter– count mallocs • Scientific Performance Method: • Measure • Fix • Repeat

Code Optimization Roadmap for Orion Sky: Understanding and Improving Performance

Code Optimization Roadmap for Orion Sky: Understanding and Improving Performance

Presentation Transcript

Code Optimization

Code Optimization and Performance

Code Optimization

Generic Code Optimization

C66x Code Optimization

Keystone Architecture Code Optimization

More Code Optimization

Code Optimization

More Code Optimization

Code Optimization

9. Quadruple Code optimization

Sparse code optimization

More Code Optimization

Final Code Generation and Code Optimization

Code Optimization

More Code Optimization

Code Optimization

Lecture 34: Code Optimization

Code Optimization II: Machine Dependent Optimization

Generic Code Optimization

Code Optimization