1 / 11

Code Optimization

Code Optimization. Orion Sky Lawlor olawlor@uiuc.edu 2003/9/17. Roadmap. Introduction gprof Timer calls Understanding Performance. Introduction. Scientific Performance Method: Measure (don’t assume!) Find the bottlenecks in the code They aren’t where you expect!

Download Presentation

Code Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Optimization Orion Sky Lawlor olawlor@uiuc.edu 2003/9/17

  2. Roadmap • Introduction • gprof • Timer calls • Understanding Performance

  3. Introduction • Scientific Performance Method: • Measure (don’t assume!) • Find the bottlenecks in the code • They aren’t where you expect! • Fix the worst problems first • Consider stopping-- is it good enough? • Fix • Improve algorithms first • Improve implementations second • Repeat (indefinitely)

  4. gprof– UNIX performance tool • Compile and link with “-pg” flag • Adds instrumentation to code • Run serial program normally • Instrumentation runs automatically • Run gprof to analyze trace • `gprof pgm gmon.out` • Shows a function-level view of execution time • Heisenberg measurement error!

  5. Timer Calls • CPU time (virtual time) • Time spent running your code • CmiCpuTimer()– 10ms resolution • Wall clock time (real time) • Includes OS interference, network delays, context-switching overhead • CmiWallTimer()– to 1ns resolution • This is what really counts

  6. How to call the timer: one time • What’s wrong with this? • double s=CmiWallTimer(); • foo(); • double e=CmiWallTimer()-s; • CkPrintf(“foo took %fs\n”,e); • If CmiWallTimer takes 100ns, and foo takes 50ns, this may print 150ns! • Only a problem for very fast functions (or slow timers!)

  7. How to call the timer: repeat • Repetition can decrease apparent timer overhead and increase resolution: • const int n=1000; • double s=CmiWallTimer(); • for (i=0;i<n;i++) foo(); • double e=(CmiWallTimer()-s)/n; • CkPrintf(“foo took %fs\n”,e); • Problem: what if foo’s performance is cache-sensitive?

  8. Understanding Performance: CPU GHz 1ns Integer Arithmetic Floating Point Branches Cache 10ns (int)x Subroutine / Memory / or % 100ns “inline” MHz 1us Bitshifts and masks *(int *)&x 10us 100us Cache-friendly algorithms KHz 1ms

  9. Understanding Performance: OS GHz 1ns Timer 10ns sin, tan,... 100ns Malloc Syscall MHz 1us 10us Tables, identities Reuse buffers 100us Avoid KHz 1ms Disk Timeslice

  10. Understanding Performance: Net GHz 1ns 10ns Message Combining Rethink 100ns Probe MHz 1us RDMA Barrier Message 10us 100us KHz 1ms

  11. Conclusions • Performance is the whole point of parallel programming • painful but necessary • Asymptotics matter– find O(n) • Constants matter– count mallocs • Scientific Performance Method: • Measure • Fix • Repeat

More Related