Program Development Tools

Program Development Tools SGI Performance Tools

Outline • Performance tools help to analyze program performance: • monitoring the processor event counters (perfex) • interpretation of the event counter data • SpeedShop and WorkShop • interpretation of the performance data

The event counters • The MIPS R1x000 cpu has on-chip event counters that are used to monitor processor activity • there are 2 counters in R10000 processor • each counter has a selector for the events to be counted: • events 0 - 15 • events 16 - 31 • seeman r10k_counters(5)for a complete description • the R1x000 software counters are available through: • the IRIX perfex command • the profiling environment • a procedural interface (Fortran and C)

The perfex command • perfex [-e e0][-e e1][-a][-mp][-y][-s][-x] cmd [args] • where • e0 and e1 are the event identifiers; should belong to non-overlapping groups (i.e. 0-15 one and 16-31 another) • -a multiplex all events • -mp report counts for each thread, as well as totals • -y give timing estimates for each count and general statistics • -s start/stop counting on the USR1/USR2 signal • -x count events on exception as well • examples: • perfex -e 2 -e 25 ./swp • ………………. output from swp execution • Issued loads ……………………………………………………. 13847619 • Primary data cache misses …………………. 2310349 • perfex -e 3 -e 26 ./swp • ………………. output from swp execution • Issued stores ……………………………………………………. 9176152 • Secondary data cache misses ………………. 9029 • Interpretation of event counters requires a model!

perfex -t: the model cost table • Costs for IP27 processor MIPS R12000 CPU • Cost Typical Minimum Maximum • Event Counter Name ============================================================================================ • 0 Cycles............................................. 1.00 clks 1.00 clks 1.00 clks • 1 Decoded instructions............................... 0.00 clks 0.00 clks 1.00 clks • 2 Decoded loads...................................... 1.00 clks 1.00 clks 1.00 clks • 3 Decoded stores..................................... 1.00 clks 1.00 clks 1.00 clks • 4 Miss handling table occupancy...................... 0.00 clks 0.00 clks 0.00 clks • 5 Failed store conditionals.......................... 1.00 clks 1.00 clks 1.00 clks • 6 Resolved conditional branches...................... 1.00 clks 1.00 clks 1.00 clks • 7 Quadwords written back from scache........... 8.49 clks 5.90 clks 8.77 clks • 8 Correctable scache data array ECC errors........... 0.00 clks 0.00 clks 1.00 clks • 9 Primary instruction cache misses..............17.01 clks 4.34 clks 17.01 clks • 10 Secondary instruction cache misses............99.89 clks 63.03 clks 99.89 clks • 11 Instr misprediction from scache way prediction.. .. 0.00 clks 0.00 clks 1.00 clks • 12 External interventions............................. 0.00 clks 0.00 clks 0.00 clks • 13 External invalidations............................. 0.00 clks 0.00 clks 0.00 clks • 14 ALU/FPU progress cycles............................ 1.00 clks 1.00 clks 1.00 clks • 15 Graduated instructions............................. 0.00 clks 0.00 clks 1.00 clks • 16 Executed prefetch instructions..................... 0.00 clks 0.00 clks 0.00 clks • 17 Prefetch primary data cache misses................. 0.00 clks 0.00 clks 1.00 clks • 18 Graduated loads.................................... 1.00 clks 1.00 clks 1.00 clks • 19 Graduated stores................................... 1.00 clks 1.00 clks 1.00 clks • 20 Graduated store conditionals....................... 1.00 clks 1.00 clks 1.00 clks • 21 Graduated floating point instructions........ 1.00 clks 0.50 clks 52.00 clks • 22 Quadwords written back from primary data cache..... 3.98 clks 3.14 clks 3.98 clks • 23 TLB misses........................................77.78 clks 77.78 clks 77.78 clks • 24 Mispredicted branches.............................. 7.28 clks 6.00 clks 8.81 clks • 25 Primary data cache misses...................... 8.50 clks 2.17 clks 8.50 clks • 26 Secondary data cache misses....................99.89 clks 63.03 clks 99.89 clks • 27 Data misprediction from scache way prediction table 0.00 clks 0.00 clks 1.00 clks • 28 State of intervention hits in scache.........……………. 0.00 clks 0.00 clks 0.00 clks • 29 State of invalidation hits in scache............... 0.00 clks 0.00 clks 0.00 clks • 30 Store/prefetch exclusive to clean block in scache.. 1.00 clks 1.00 clks 1.00 clks • 31 Store/prefetch exclusive to shared block in scache. 1.00 clks 1.00 clks 1.00 clks

Perfex -a -y: multiplexing events • Based on 300 MHz IP27 MIPS R12000 CPU • Typical Minimum Maximum • Event Counter Name Counter Value Time (sec) Time (sec) Time (sec) • ========================================================================================================= • 0 Cycles........................................... 5019528970688 16731.763236 16731.763236 16731.763236 • 16 Executed prefetch instructions................………... 8606259264 0.000000 0.000000 0.000000 • 4 Miss handling table occupancy.................….. 5443332412048 18144.441373 18144.441373 18144.441373 • 18 Graduated loads...............................…... 938555860624 3128.519535 3128.519535 3128.519535 • 2 Decoded loads..................................... 935411271824 3118.037573 3118.037573 3118.037573 • 25 Primary data cache misses......................... 83185778832 2356.930400 601.710467 2356.930400 • 21 Graduated floating point instructions............. 519085469744 1730.284899 865.142450 89974.814756 • 3 Decoded stores.................................... 413277338336 1377.591128 1377.591128 1377.591128 • 19 Graduated stores.................................. 411214746144 1370.715820 1370.715820 1370.715820 • 22 Quadwords written back from primary data cache.... 63439898464 841.635986 664.004271 841.635986 • 6 Resolved conditional branches..................... 231879145424 772.930485 772.930485 772.930485 • 26 Secondary data cache misses....................... 1951064336 649.639388 409.918617 649.639388 • 7 Quadwords written back from scache................ 13451750864 380.684549 264.551100 393.239517 • 23 TLB misses........................................ 1238242032 321.034884 321.034884 321.034884 • 9 Primary instruction cache misses.................. 1719354640 97.487408 24.873330 97.487408 • 24 Mispredicted branches............................. 3744555216 90.867873 74.891104 109.965105 • 10 Secondary instruction cache misses................ 26286992 8.752692 5.522897 8.752692 • 31 Store/prefetch exclusive to shared block in scache 1435428608 4.784762 4.784762 4.784762 • 20 Graduated store conditionals...................... 639716576 2.132389 2.132389 2.132389 • 5 Failed store conditionals......................... 54079168 0.180264 0.180264 0.180264 • 30 Store/prefetch exclusive to clean block in scache. 20579264 0.068598 0.068598 0.068598 • 1 Decoded instructions............................. 3312748123920 0.000000 0.000000 11042.493746 • 8 Correctable scache data array ECC errors......… 0 0.000000 0.000000 0.000000 • 11 Instruction mispred from scache way prediction table..166448576 0.000000 0.000000 0.554829 • 12 External interventions............................ 1584014720 0.000000 0.000000 0.000000 • 13 External invalidations............................ 3304023728 0.000000 0.000000 0.000000 • 14 ALU/FPU progress cycles........................... 0 0.000000 0.000000 0.000000 • 15 Graduated instructions........................... 3268746182224 0.000000 0.000000 10895.820607 • …

Perfex -y statistics (R12000) • Statistics • ========================================================================================= • Graduated instructions/cycle................................................ 0.651206 • Graduated floating point instructions/cycle................................. 0.103413 • Graduated loads & stores/cycle.............................................. 0.268904 • Graduated loads & stores/floating point instruction......................... 2.600286 • Mispredicted branches/Resolved conditional branches......................... 0.016149 • Graduated loads /Decoded loads ( and prefetches )........................... 0.994214 • Graduated stores/Decoded stores............................................. 0.995009 • Data mispredict/Data scache hits............................................ 0.036721 • Instruction mispredict/Instruction scache hits.............................. 0.098312 • L1 Cache Line Reuse......................................................... 15.225978 • L2 Cache Line Reuse......................................................... 41.636102 • L1 Data Cache Hit Rate...................................................... 0.938370 • L2 Data Cache Hit Rate...................................................... 0.976546 • Time accessing memory/Total time............................................ 0.467783 • L1--L2 bandwidth used (MB/s, average per process)........................... 219.760658 • Memory bandwidth used (MB/s, average per process)........................... 27.789316 • MFLOPS (average per process)................................................ 31.023955 • Cache misses in flight per cycle (average).................................. 1.084431 • Prefetch cache miss rate.................................................... 0.606133

Explanation statistics/1 loads + stores - misses misses L1 cache line reuse = L1 cache misses - L2 cache misses L2 cache misses L2 cache line reuse = L1 cache misses loads + stores 1- L1 data cache hit rate = L2 cache misses L1 cache misses 1- L2 data cache hit rate = • L1 cache line reuse: should be large >>1 • it is the number of times, on average that a primary data cache line is used after it has been moved into the cache • L2 cache line reuse: should be large >>1 • it is the average number of times the cache line was used after it moved into the cache • L1 data cache hit rate: should be ~1 • the fraction of data accesses that finds data already resident in L1 • L2 data cache hit rate: should be ~1 • the fraction of data accesses that finds data already resident in L2

Explanation statistics/2 T(loads+stores+L1 miss+L2 miss+TLB miss) total run time • Instruction or data mispredictions/hits: should be small (~0) • L1-L2 bandwidth used ([MB/s] average per process) • total number of bytes moved between L1 and L2 divided by total run time • Memory bandwidth used ([MB/s] average per process) • total number of bytes moved between L2 an memory divided by total time • Performance Mflop/s: larger is better • it is computed from the #fp instructions divided by total run time • the madd instruction counts as 1, so this statistic usually greatly underestimates the performance in Mflop/s • Time accessing memory/Total time: should be small (~0) #bytes moved = (L1 cache miss * L1 cache line size + Quadwords written back *16 bytes) #bytes moved = (L2 cache miss * L2 cache line size + Quadwords written back *16 bytes)

Explanation statistics/3 = 0 .. 5 MHT cycles Mh= L1 miss prefetch(17) #prefetch instructions(16) Rp= = 0 .. 1 • The R12000 processor gives additional information: • Miss handling table occupancy: • average number of entries in MHT. A higher value indicates more outstanding cache misses. An interesting observation is to get • it should be >1, since if it is close to 1 there might be dependency in the data that prevents the processor to issue multiple outstanding cache misses • it could be <1 for code that fits into the cache • high values for Mh while the memory bandwidth is low would indicate a lot of L1 misses into the same L2 cache line • Prefetch miss rate: • rate of prefetch instructions missing in the first level cache • should be close to 1: Every prefetch to be effective should miss in the cache.

R1x000 Important events

Cache Coherency counters • The directory cache coherency protocol sends signals to nodes that requested data from memory in the past • Invalidations are transactions that convert cache lines from the “shared” to invalid state • Interventions are transactions that convert cache line from the exclusive to “shared” or “invalid” (depending on whether the requestor asked for “shared” or “exclusive” access) • Counters 12 and 13 tell how many interventions and invalidations have been received by the processor; counters 28 and 29 tell how many of them actually hit in to the cache • Invalidations and/or interventions are send to a node, because the cache line owner thinks that node has the cache line. However, if that processor has dropped the cache line earlier, the signal will not find the data in the cache and counters 12 or 13 will be updated, while 28 or 29 will not.

Explanation: useful counters • Cycles are proportional to time spend on the cpu • spin times are added, but wait times are not • Graduated instructions divided by cycles will give the IPC measure for the program. The R1x000 processor can process up to 4 IPC, a value of IPC >1.5 should be considered good. • ALU/FPU progress cycles are the actual computation on the processor • TLB misses is counting how many times the virtual translation entry has not been found in the Translation Look-aside Table. If that is a large contribution, larger size pages might be used in the application

Event countersprocedural interface • C synopsis • Fortran synopsis • The event specifiers e0 and e1 can also be set (or overruled) at run time by the environment variables • T5_EVENT0 and T5_EVENT1 • link with -lperfex • the counter values c0 and c1 are 64 bit integers int start_counters(int e0, int e1); int read_counters( int e0, long long *c0, int e1, long long *c1); int print_counters(int e0, long long c0, int e1, long long c1); INTEGER*8 C0, C1 INTEGER*4 E0, E1 INTEGER*4 FUNCTION START_COUNTERS(E0, E1) INTEGER*4 FUNCTION READ_COUNTERS(E0, C0, E1, C1) INTEGER*4 FUNCTION PRINT_COUNTERS(E0, C0, E1, C1)

The SpeedShop profiler • No recompilation or relinking is necessary • Multiple performance experiments and views • The following performance experiments are supported: • Program Counter (PC) sampling -pcsamp • to get exclusive time per subroutine and per source line • no call stack information • User Time -usertime • to get exclusive & inclusive times per subroutine (call stack) • no source line information • Hardware counter profiling -”event”_hwc • Communication overhead of MPI -mpi • Floating point exception tracing -fpe • Tracing of I/O system call -io • malloc/free tracing -heap • basic-block counting experiment -ideal • seeman speedshopfor explanation of the experiments • usage: • ssrun -exp [options] program [args] • prof “ssrun output file”

SpeedShop Experiments 10 ms normal experiments (e.g. -pcsamp) 1 ms “fast” experiments (e.g. -fpcsamp) Time interrupt main sub1 Program Counter (PC) sub2 sub3 subn 16 bit number or 32 bit number Event dependent thresholds “fast” experiments exist Event interrupt Usage counts address 0x000 0xf30 0x000 PC Sampling Experiment: Record current instruction address in a histogram (Overhead <1%) 16 bit wide bins (e.g. -pcsamp) 32 bit wide bins (e.g. -pcsampx) 0x100 0x1f0 instructions 0xcf0 0xdf0 Usertime Sampling Experiment: Record all addresses on call stack (Overhead <5%) 0xe00 0xf30 histogram

ssrun: experiment recording • ssrun writes out performance measurements into a binary file which is named after the program: • ssrun -expnameoptionsprogname prog-arg • here the output will be binary files with the nameprogname.expname.ID#pid • where ID is a letter code as follows: • m the master process created by ssrun • p slave processes in parallel execution (call to sproc()) • f slave processes created by a call to fork() • e process created by a call to exec() • s a system call process • examples: • > mpirun -np 2 ssrun -mpi cgmd • -rw-r--r-- 1 zacharov user 237672 Sep 10 13:50 cgmd.mpi.f318839 • -rw-r--r-- 1 zacharov user 4733496 Sep 10 13:50 cgmd.mpi.f318854 • -rw-r--r-- 1 zacharov user 11896 Sep 10 13:50 cgmd.mpi.m318787 • > ssrun -usertime cgmd • -rw-r--r-- 1 zacharov user 1613562 Sep 10 14:28 cgmd.usertime.m320140 • -rw-r--r-- 1 zacharov user 1601776 Sep 10 14:28 cgmd.usertime.p320140

prof: Performance interpretation • prof is the command-line interpreter for ssrun experiments • Feedback file generation for compiler and linker • Use cvperf(1)(not prof) for heap and MPI trace analysis • prof options: • -heavy report the performance per source code line; order in descending performance measure • -lines report the performance per source code line; order in program sequence • -usage summary of the system resources used by the program • -gprof show inclusive timing of callers and callees of each function • -q n truncate after the first n routines or lines • -q n% truncate after the routine that takes >n% of total time • -q ncum% truncate after n% of cumulative time • warning: executable should not contain strip-ed parts

PC sampling experiment/1 • % ssrun -fpcsampx ./program.exe • % prof -lines program.exe.fpcsampx.m1329 • ------------------------------------------------------------------------- • SpeedShop profile listing generated Tue Feb 10 17:53:44 1998 • prof -lines program.exe.fpcsampx.m1329 • program.exe (n32): Target program • fpcsampx: Experiment name • pc,4,1000,0:cu: Marching orders • R10000 / R10010: CPU / FPU • 4: Number of CPUs • 195: Clock frequency (MHz.) • Experiment notes-- • From file program.exe.fpcsampx.m1329: • Caliper point 0 at target begin, PID 1329 • /usr/users/Examples/speedshop/./program.exe • Caliper point 1 at exit(0) • ------------------------------------------------------------------------- • Summary of statistical PC sampling data (fpcsampx)-- • 26758: Total samples • 26.758: Accumulated time (secs.) • 1.0: Time per sample (msecs.) • 4: Sample bin width (bytes) • ------------------------------------------------------------------------- • Function list, in descending order by time • ------------------------------------------------------------------------- • [index] secs % cum.% samples function (dso: file, line) • [1] 9.180 34.3% 34.3% 9180 sub1 (prog.exe: program.f, 39) • [2] 7.070 26.4% 60.7% 7070 update_sum (prog.exe: program.f,62) • [3] 6.182 23.1% 83.8% 6182 sub3 (prog.exe: program.f, 112) • [4] 1.968 7.4% 91.2% 1968 irand_ (libftn.so: rand_.c, 62) • [5] 1.317 4.9% 96.1% 1317 rand_ (libftn.so: rand_.c, 67) • [6] 1.040 3.9% 100.0% 1040 ssdemo (prog.exe: program.f, 1) • 0.001 0.0% 100.0% 1 **OTHER** (excluded DSOs, rld,etc.) • 26.758 100.0% 100.0% 26758 TOTAL

PC sampling experiment/2 • ------------------------------------------------------------------------- • Line list, in descending order by function-time and then line number • ------------------------------------------------------------------------- • secs % cum.% samples function (dso: file, line) • 0.002 0.0 0.0 2 sub1 (program.exe: program.f, 49) • 0.007 0.0 0.0 7 sub1 (program.exe: program.f, 50) • 0.930 3.5 3.5 930 sub1 (program.exe: program.f, 54) • 8.241 30.8 34.3 8241 sub1 (program.exe: program.f, 55) • 0.858 3.2 37.5 858 update_sum (program.exe: program.f, 72) • 1.263 4.7 42.2 1263 update_sum (program.exe: program.f, 73) • 4.949 18.5 60.7 4949 update_sum (program.exe: program.f, 74) • 0.001 0.0 60.7 1 sub3 (program.exe: program.f, 122) • 1.029 3.8 64.6 1029 sub3 (program.exe: program.f, 123) • 5.152 19.3 83.8 5152 sub3 (program.exe: program.f, 124) • 0.785 2.9 86.8 785 irand_ (libftn.so: rand_.c, 62) • 1.183 4.4 91.2 1183 irand_ (libftn.so: rand_.c, 63) • 0.076 0.3 91.5 76 rand_ (libftn.so: rand_.c, 67) • 1.241 4.6 96.1 1241 rand_ (libftn.so: rand_.c, 69) • 0.174 0.7 96.8 174 ssdemo (program.exe: program.f, 20) • 0.866 3.2 100.0 866 ssdemo (program.exe: program.f, 21)

Hardware counters experiment/1 • % ssrun -ftlb_hwc program.exe ; prof program.exe.ftlb_hwc.m1335 • ------------------------------------------------------------------------- • SpeedShop profile listing generated Tue Feb 10 17:54:23 1998 • prof -lines program.exe.ftlb_hwc.m1335 • program.exe (n32): Target program • ftlb_hwc: Experiment name • hwc,23,53:cu: Marching orders • R10000 / R10010: CPU / FPU • 4: Number of CPUs • 195: Clock frequency (MHz.) • Experiment notes-- • From file program.exe.ftlb_hwc.m1335: • Caliper point 0 at target begin, PID 1335 • /usr/users/Examples/speedshop/./program.exe • Caliper point 1 at exit(0) • ------------------------------------------------------------------------- • Summary of R10K perf. counter overflow PC sampling data (ftlb_hwc)-- • 412712: Total samples • TLB misses (23): Counter name (number) • 53: Counter overflow value • 21873736: Total counts • ------------------------------------------------------------------------- • Function list, in descending order by counts • ------------------------------------------------------------------------- • [index] counts % cum.% samples function (dso: file, line) • [1] 10224919 46.7% 46.7% 192923 sub1 (program.exe: program.f, 39) • [2] 5252035 24.0% 70.8% 99095 sub3 (program.exe: program.f, 112) • [3] 5111691 23.4% 94.1% 96447 update_sum (program.exe: program.f, 62) • [4] 1043729 4.8% 98.9% 19693 ssdemo (program.exe: program.f, 1) • [5] 241097 1.1% 100.0% 4549 rand_ (libftn.so: rand_.c, 67) • [6] 53 0.0% 100.0% 1 __checktraps (libc.so.1: stubfpeexit.c, 3) • [7] 53 0.0% 100.0% 1 exceed_length (libftn.so: wrtfmt.c, 1421) • [8] 53 0.0% 100.0% 1 wrt_I (libftn.so: wrtfmt.c, 102) • [9] 53 0.0% 100.0% 1 do_Lio_com (libftn.so: lio.c, 803) • 53 0.0% 100.0% 1 **OTHER** (includes excluded DSOs, rld, etc.) • 21873736 100.0% 100.0% 412712 TOTAL

Hardware counters experiment/2 • Use the environment variables (see man speedshop) : • % setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 6 • % setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 100 • % ssrun -prof_hwc ./program.exe • % prof -lines program.exe.prof_hwc.m1336 • ------------------------------------------------------------------------- • SpeedShop profile listing generated Tue Feb 10 17:54:56 1998 • prof -lines program.exe.prof_hwc.m1336 • program.exe (n32): Target program • prof_hwc: Experiment name • hwc:cu: Marching orders • R10000 / R10010: CPU / FPU • 4: Number of CPUs • 195: Clock frequency (MHz.) • Experiment notes-- • From file program.exe.prof_hwc.m1336: • Caliper point 0 at target begin, PID 1336 • /usr/users/Examples/speedshop/./program.exe • Caliper point 1 at exit(0) • ------------------------------------------------------------------------- • Summary of R10K perf. counter overflow PC sampling data (prof_hwc)-- • 2153328: Total samples • Decoded branches (6): Counter name (number) • 100: Counter overflow value • 215332800: Total counts etc…

SpeedShop Performance Views • Function list, program graph, butterfly • source code with performance annotation • disassembly and basic blocks • architectural information • usage data (resource consumption on the machine) • Heap analysis and memory leaks • working sets • MP overhead view (Pthreads, MPI, OpenMP)

SpeedShop basic tools • ssrun • basic performance data generation tool • outputs binary file with the performance experiment results per program or execution thread • prof • human readable interpretation of the ssrun binary files • sscord/ssorder and sswsextr • used to extract the working set information from a program to re-order the program subroutines to pack frequently executed instructions on same memory pages (reduces TLB miss for instr.) • ssusage:ssusage program prog-args • similar to time, but gives resident size and I/O characteristics • squeeze and thrash • change amount of memory available to the user program • ssdump • dump the binary performance data files in ASCII format • fbdump • writes compiler feedback files from prof • ssaggregate • combines multiple SpeedShop binary performance data files into one

Ssrun: performance experiment • ssrun -expnameoptionsprogram prog-arg • ssrun options: • -workshop files will be used by the GUI tools (WorkShop) • Performance experiments: -expname | -exp expname • -usertime(to get exclusive & inclusive times per subroutine) • Measures CPU time, including • the time program is actually running • the time the system is providing services for your program (system calls) • excludes the time the program is waiting to execute • -[f]pcsamp[x] (to get exclusive time per subroutine and source line) • Measures the time program is actually running • excludes the time the program is waiting for system services • the f is for 1 ms interrupt interval, x is for 32 bit bins • -ideal • counts Basic Blocks in the program (needed for compiler feedback) • -mpi • traces various Message Passing Interface (MPI) routines • -usertime kernprof -p cpu-id • kernel profiling for cpu(s) cpu-id (see man kernprof(1)) (no source line info) (no call stack info)

Ssrun: performance experiments (hardware counter) • The R1x000 event counters can be used for performance analysis. All experiments are described inman ssrun(1) • Useful experiments: • -[f]dsc_hwc • Counts overflows of the 2nd level cache miss counter (counter 26) • the address is recorded if counter reaches 131 counts (29 for “fast”) • -[f]gfp_hwc • overflows of the graduated floating point instruction counter (21) • the address is recorded if counter reaches 32771 counts (6553 for “fast” experiments) • -prof_hwctime • profiles the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER using call-stack sampling • addresses are recorded based on the overflow of the env variable _SPEEDSHOP_HWC_COUNTER_OVERFLOW • -prof_hwc • same as prof_hwctime for PC sampling with event counter overflow

User time experiment • SpeedShop profile listing generated Tue Feb 10 17:51:59 1998 • prof -gprof program.exe.usertime.m1332 • program.exe (n32): Target program • usertime: Experiment name • ut:cu: Marching orders • R10000 / R10010: CPU / FPU • 4: Number of CPUs • 195: Clock frequency (MHz.) • Experiment notes-- • From file program.exe.usertime.m1332: • Caliper point 0 at target begin, PID 1332 • /usr/users/ruud/Examples/speedshop/./program.exe • Caliper point 1 at exit(0) • ------------------------------------------------------------------------- • Summary of statistical callstack sampling data (usertime)-- • 301: Total Samples • 0: Samples with incomplete traceback • 9.030: Accumulated Time (secs.) • 30.0: Sample interval (msecs.) • ------------------------------------------------------------------------- • Function list, in descending order by exclusive time • ------------------------------------------------------------------------- • [index] excl.secs excl.% cum.% incl.secs incl.% samples procedure (dso: file, line) • [4] 2.850 31.6% 31.6% 5.370 59.5% 179 sub1 (program.exe: program.f, 39) • [6] 2.520 27.9% 59.5% 2.520 27.9% 84 update_sum (program.exe: program.f, 62) • [5] 1.980 21.9% 81.4% 3.000 33.2% 100 sub3 (program.exe: program.f, 112) • [8] 0.660 7.3% 88.7% 0.660 7.3% 22 irand_ (libftn.so: rand_.c, 62) • [7] 0.480 5.3% 94.0% 1.140 12.6% 38 rand_ (libftn.so: rand_.c, 67) • [1] 0.390 4.3% 98.3% 9.030 100.0% 301 ssdemo (program.exe: program.f, 1) • [10] 0.090 1.0% 99.3% 0.090 1.0% 3 _write (libc.so.1: write.s, 15) • [16] 0.030 0.3% 99.7% 0.030 0.3% 1 s_wsle64 (libftn.so: lio.c, 132) • [17] 0.030 0.3% 100.0% 0.030 0.3% 1 s_stop (libftn.so: s_stop.c, 54) • [2] 0.000 0.0% 100.0% 9.030 100.0% 301 __start (program.exe: crt1text.s, 103) • [3] 0.000 0.0% 100.0% 9.030 100.0% 301 main (libftn.so: main.c, 76) • [9] 0.000 0.0% 100.0% 0.120 1.3% 4 sub2 (program.exe: program.f, 81) • [11] 0.000 0.0% 100.0% 0.090 1.0% 3 e_wsle (libftn.so: lio.c, 205)

-usertime experiment -gprof • ------------------------------------------------------------------------- • Butterfly function list, in descending order by inclusive time • ------------------------------------------------------------------------- • attrib.% attrib.time incl.time caller (callsite) [index] • [index] incl.% incl.time self% self-time procedure [index] • attrib.% attrib.time incl.time callee (callsite) [index] • ------------------------------------------------------------------------- • 100.0% 9.030 9.030 main (@0x0b35c004; libftn.so: main.c, 97) [3] • [1] 100.0% 9.030 4.3% 0.390 ssdemo [1] • 59.5% 5.370 5.370 sub1 (@0x10000ef0; prog.exe: program.f, 27) [4] • 33.2% 3.000 3.000 sub3 (@0x10000f30; program.exe: program.f,33) [5] • 1.3% 0.120 1.140 rand_ (@0x10000e20; program.exe: program.f, 21) [7] • 1.3% 0.120 0.120 sub2 (@0x10000f14; program.exe: program.f, 31) [9] • 0.3% 0.030 0.030 sstop (@0x10000f68; program.exe: program.f, 36)[17] • ------------------------------------------------------------------------- • [2] 100.0% 9.030 0.0% 0.000 __start [2] • 100.0% 9.030 9.030 main (@0x10000d68; program.exe: crt1text.s, 166)[3] • ------------------------------------------------------------------------- • 100.0% 9.030 9.030 start (@0x10000d68; program.exe: crt1text.s,166)[2] • [3] 100.0% 9.030 0.0% 0.000 main [3] • 100.0% 9.030 9.030 ssdemo (@0x0b35c004; libftn.so: main.c, 97) [1] • ------------------------------------------------------------------------- • 59.5% 5.370 9.030 ssdemo (@0x10000ef0; program.exe: program.f, 27)[1] • [4] 59.5% 5.370 31.6% 2.850 sub1 [4] • 27.9% 2.520 2.520 update_sum(@0x10001024;program.exe:program.f,51)[6] • ------------------------------------------------------------------------- • 33.2% 3.000 9.030 ssdemo (@0x10000f30; program.exe: program.f, 33)[1] • [5] 33.2% 3.000 21.9% 1.980 sub3 [5] • 11.3% 1.020 1.140 rand_ (@0x10001470; program.exe: program.f, 124)[7] • ------------------------------------------------------------------------- • 27.9% 2.520 5.370 sub1 (@0x10001024; program.exe: program.f, 51) [4] • [6] 27.9% 2.520 27.9% 2.520 update_sum [6] • ------------------------------------------------------------------------- • 1.3% 0.120 9.030 ssdemo (@0x10000e20; program.exe: program.f, 21)[1] • 11.3% 1.020 3.000 sub3 (@0x10001470; program.exe: program.f, 124) [5] • [7] 12.6% 1.140 5.3% 0.480 rand_ [7] • 7.3% 0.660 0.660 irand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [8] • ------------------------------------------------------------------------- • 7.3% 0.660 1.140 rand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [7] • [8] 7.3% 0.660 7.3% 0.660 irand_ [8] • -------------------------------------------------------------------------

Summary • There is a large set of performance measurement tools: • to collect the running status of the machine • to collect the resource usage of individual programs • There is a large set of performance analysis tools: • perfex, based on the processor event counters, valid for evaluation of complete programs or parts of programs • SpeedShop tools for command-line evaluation of the performance data

Program Development Tools