1 / 18

Loop-Based Automated Performance Analysis

Loop-Based Automated Performance Analysis. Eli Collins eli@cs.wisc.edu Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA. Motivation. Automated performance analysis Ongoing work, APART Previous work: Callgraph, Deepstart

toviel
Download Presentation

Loop-Based Automated Performance Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Loop-Based Automated Performance Analysis Eli Collins eli@cs.wisc.edu Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA

  2. Motivation • Automated performance analysis • Ongoing work, APART • Previous work: Callgraph, Deepstart • Faster, more efficient searching • This work: better localize performance problems • Report performance data at finer granularity

  3. Motivation (Cont.) • Function granularity works well • Don’t overload user w/ fine-grain data • Why is a function a bottleneck? • Large function w/ multiple bottlenecks • Small function called repeatedly in a loop • Idea: search inside bottleneck functions

  4. Performance Consultant (PC) • For code, PC searches the callgraph • Breadth First Search • Prune non-bottleneck functions • Introduce a new callgraph level that • Is a logical unit of computation • Improves granularity • Partitions functions for searching • Keeps search space manageable (scalability)

  5. Loops in the Callgraph main main f1 f2 loop 1 f1 f1 f2 Callgraph Callgraph w/ loops

  6. Why Loops? • Loops may be bottlenecks themselves • Especially in scientific and long-running applications • Loops are natural sources of parallelism • Compilers/HW exploit • OpenMP PARALLEL DO, loop unrolling/fusion • Provide feedback as to the effectiveness of these optimizations

  7. Why Loops? • Loops logically decompose functions • Natural hierarchy (name by nesting) • We instrument loops in binary • Binary is what actually executes • Typically can correlate PC results w/ original source • Difficult w/ basic block, instruction granularity

  8. What’s new? • Loop-level performance data is not new • Existing tools: DPOMP, HPCView, SvPablo • Edge instrumentation in EEL and OM • Integrate loops into automated search • Techniques to instrument loops on-the-fly • Technical challenges doing this efficiently • Especially on IA32 (AMD64/EM64T) • Results for some MPI/OpenMP applications

  9. Binary Loop Instrumentation 1 2 3 4 LP: inc %edx inc %eax cmp $0x64,%eax jg DONE inc %edx inc %eax cmp %edx,%eax jl LP DONE: do { ... if (x > 100) break; ... } while (x < y); Entry Begin iter. End iter. Exit

  10. New Instrumentation Techniques • Traditional function, edge instrumentation • Function relocation, previously • Function entry, exit, callsites • For loops, may relocate function again • Ensure enough padding around basic blocks which need to be instrumented • Avoid trap-based instrumentation

  11. Loop-based Search Strategy • PC uses loops as steps in its refinement

  12. Loop-based Search Strategy • Inclusive metric, instrument loop entry/exit • If a node is a bottleneck, instrument • Function: outermost loops and call sites • Loop: nested loops and call sites • # of PC experiments • More total experiments possible w/ loops • But loops can help prune search • E.g. loops which contain multiple call sites

  13. Test Applications

  14. Results • Loops were frequently bottlenecks • 10 total leaf-level function bottlenecks • 7 of these contained loop bottlenecks • Bottleneck functions had many loops • Especially true for Fortran applications • OM3: 1 function, 83% CPU, 90 loops • Good results, even when code not modular • Correlate loop w/ source using call sites

  15. Bottlenecks (ALARA)

  16. Bottlenecks (SPhot)

  17. Summary • Not much overhead • Avoid trap-based instrumentation • Only instrument loops of bottlenecks functions • Find bottlenecks at similar rate • Loop-aware finds more, more in total to find • More precise results • Little change in search time • Similar rates of experimentation

  18. Loop-Based Automated Performance Analysis eli@cs.wisc.edu http://www.paradyn.org http://www.dyninst.org

More Related