PerfLib Overview

PerfLib Overview Jeff Brown LANL/CCS-7 July 28, 2010

PerfLib is ... a performance measurement and analysis tool which consists of two components: - a run-time library (libperfrt), and - post processing library, scripts, and utility programs

How to use PerfLib 1. instrument the code leverage existing timing infra (rage, flag, partisn, ...) instrumentation moves with the code to new systems calipers – must be “well formed” (beware alt. returns) 2. build the code linking the run-time library minor modifications to the build system 3. run code to collect data trigger data collection via: - environment variable settings or - library calls 4. post process to analyse results (runpp, perfpp, etc.)

Performance Data Collected Profiling: - timing - MPI (via profiling interface – no additional instr req. - hardware counters (e.g. flops, cache, tlb, etc.) - memory allocation (RSS footprint) - IO Tracing: - timing/MPI - memory allocation

Post Processing perfpp: a do it all script that discovers what data was collected and generates all possible reports/plots runpp: just generates the reports some sample reports and plots ...

PerfLib Overhead Tracked: ~2% for timing/MPI 12% for hardware counters, 21% for memory allocation

PerfLib Header Report for64 PE Comet Impact Run on tu PERFlib version 3.0 Data path: /scratch2/jeffb/tu/PerfTrack/3.0/crestone/timing/user_problems/comet/2.56km/48/64/20100324 Data directory: comet-20100324 Performance data header: run date: 20100324 OS version: Linux code version: xrage.1003.00 compiled with: unknown MPI version: /usr/projects/packages/openmpi/tu133/openmpi-intel-1.3.3 problem name: comet.input hosts: 64 processes running on (all processors with 2.30GHz cpu, 512 KB level 2 cache, 1024 TLB entries, 32189.25GB physical memory) tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain Profile metrics: Time (wall clock) enabled (PAPI timer) Counters (hardware performance data) not enabled (to enable, setenv PERF_PROFILE_COUNTERS) Memory not enabled (to enable, setenv PERF_PROFILE_MEMORY) MPI enabled IO enabled Trace metrics: Memory not enabled (to enable, setenv PERF_TRACE_MEMORY) Performance data dump frequency - every 10 cycles

Tuesday night comet impact run on tu timing report filtered at 3% rank 0, last cycle cumulative Cycle: 7234, MPI rank: 0, all instrumentation levels dumping elapsed time performance data since start of run (inclusive nested calling tree) routines with average time/call > 10 us after 1000 calls skipping routines with < 3% wall clock time and < 1e+08 bytes sent/rcvd and < 1e+06 bytes written controller 100.00%; 177.89 s(177.89 s); 1 call + controller_0 99.92%; 177.74 s(177.74 s); 1 call + . controller_3 96.71%; 172.04 s(172.04 s); 96 calls (1.792 s/call avg, 1.324 s min, 3.130 s max) + . | cycle 94.46%; 168.04 s(168.04 s); 95 calls (1.769 s/call avg, 1.578 s min, 2.005 s max) + . | . hydro 65.71%; 116.89 s(116.89 s); 95 calls (1.230 s/call avg, 1.210 s min, 1.363 s max) + . | . + cdt 10.13%; 18.025 s(18.025 s); 285 calls (0.063 s/call avg, 0.056 s min, 0.072 s max) + . | . + . xmeos 6.10%; 10.859 s(10.859 s); 285 calls (0.038 s/call avg, 0.035 s min, 0.046 s max) + . | . + . token_allreduce 3.29%; 5.847 s(5.847 s); 285 calls (0.021 s/call avg, 0.015 s min, 0.028 s max) + . | . + . | MPI_Allreduce 3.28%; 5.842 s(5.842 s); 285 calls (0.020 s/call avg, 0.015 s min, 0.027 s max) 2.23 KB sent (avg: 8 B, BW: 0.000372168 MB/s); 2.23 KB rcvd (avg: 8 B, BW: 0.000372168 MB/s) + . | . + hydro_lanl_1 55.53%; 98.774 s(98.774 s); 190 calls (0.520 s/call avg, 0.507 s min, 0.586 s max) + . | . + . d_common 19.79%; 35.206 s(35.206 s); 3420 calls (0.010 s/call avg, 8739 us min, 0.043 s max) + . | . + . | . MPI_Irecv 0.02%; 0.038 s(0.038 s); 17064 calls (0.000002 s/call avg, 1 us min, 21 us max) 99046.69 KB rcvd (avg: 5943.73 B, BW: 2516 MB/s) + . | . + . | (d_common exclusive) 16.12%; 28.673 s(28.673 s); + . | . + . h_1_advect_vol 5.53%; 9.829 s(9.829 s); 190 calls (0.052 s/call avg, 0.050 s min, 0.060 s max) + . | . + . | d_common_vec 4.48%; 7.965 s(7.965 s); 190 calls (0.042 s/call avg, 0.041 s min, 0.047 s max) + . | . + . | . (d_common_vec exclusive) 4.29%; 7.639 s(7.639 s); + . | . + . d_fvol 8.18%; 14.558 s(14.558 s); 1140 calls (0.013 s/call avg, 0.011 s min, 0.028 s max) + . | . + . | (d_fvol exclusive) 7.26%; 12.912 s(12.912 s); + . | . + . seteng 5.77%; 10.272 s(10.272 s); 190 calls (0.054 s/call avg, 0.054 s min, 0.061 s max) + . | . + . (hydro_lanl_1 exclusive) 11.26%; 20.025 s(20.025 s); + . | . calscr 3.32%; 5.906 s(5.906 s); 95 calls (0.062 s/call avg, 0.061 s min, 0.075 s max) + . | . freeze_restore 3.97%; 7.066 s(7.066 s); 95 calls (0.074 s/call avg, 0.071 s min, 0.086 s max) + . | . recon 17.63%; 31.362 s(31.362 s); 95 calls (0.330 s/call avg, 0.154 s min, 0.470 s max) + . | . + cdt 3.04%; 5.414 s(5.414 s); 95 calls (0.057 s/call avg, 0.056 s min, 0.060 s max) + . | . + cell_get 3.18%; 5.650 s(5.650 s); 31480 calls (0.000179 s/call avg, 51 us min, 2269 us max) + . | . + . token_get 3.04%; 5.400 s(5.400 s); 31480 calls (0.000172 s/call avg, 47 us min, 2263 us max) + . | . + . | MPI_Issend 0.08%; 0.150 s(0.150 s); 47160 calls (0.000003 s/call avg, 1 us min, 69 us max) 144710.39 KB sent (avg: 3142.14 B, BW: 943.477 MB/s) + . | . cdt 3.24%; 5.763 s(5.763 s); 95 calls (0.061 s/call avg, 0.060 s min, 0.063 s max) + . | . + . | pwrite 0.04%; 0.067 s(0.067 s); 13 calls (0.005147 s/call avg, 4546 us min, 5696 us max) 65.00 MB written @ 971.3671 MB/s call tree stats: depth: 11 nodes: 794

memory allocation report Cycle: 7234, MPI rank: 0, all instrumentation levels dumping memory allocation performance data since start of run (inclusive nested calling tree) physical memory on node: 32189.25 MBytes (hostname: tua023:tua023.localdomain) memory allocated (rss): 102.95 MBytes (44.8359 MBytes allocated prior to 1st instrumentation point) minimum free memory: 26665.55 MBytes (82.84%) total page faults: 71 (22 page faults prior to 1st instrumentation point) dumping memory allocation performance data (rss growth) - inclusive nested calling tree skipping routines with < 3% allocated memory (rss growth) and < 3% page faults controller 100.00%, 58.1094 Mbytes, 49 page faults + controller_0 97.22%, 56.4922 Mbytes, 47 page faults + . tread 34.43%, 20.0078 Mbytes, 2 page faults + . | pio_open 16.61%, 9.65234 Mbytes + . | . universal_file_read_common 15.45%, 8.97656 Mbytes + . | . + bulkio_read_s 8.69%, 5.05078 Mbytes + . | . + . bulkio_read_d 8.69%, 5.05078 Mbytes + . | . + . | pread 8.60%, 5 Mbytes + . | . + token_bcast 6.72%, 3.90234 Mbytes + . | resize 15.51%, 9.01172 Mbytes, 1 page faults + . | . (resize exclusive) 15.49%, 9.00391 Mbytes, 1 page faults + . restart 30.14%, 17.5156 Mbytes, 10 page faults + . | bldint 10.43%, 6.05859 Mbytes + . | . resize 6.60%, 3.83594 Mbytes + . | . + (resize exclusive) 6.60%, 3.83594 Mbytes + . | seteng 5.17%, 3.00391 Mbytes + . | (restart exclusive) 10.78%, 6.26562 Mbytes, 8 page faults + . controller_3 29.69%, 17.25 Mbytes, 21 page faults + . | cycle 27.42%, 15.9336 Mbytes + . | . hydro 25.16%, 14.6211 Mbytes + . | . + hydro_lanl_1 22.63%, 13.1484 Mbytes + . | . + . d_common 4.45%, 2.58594 Mbytes + . | . + . | (d_common exclusive) 4.30%, 2.5 Mbytes + . | . + . h_1_advect_vol 7.32%, 4.25391 Mbytes + . | . + . | d_common_vec 7.31%, 4.24609 Mbytes + . | . + . | . (d_common_vec exclusive) 7.10%, 4.125 Mbytes + . | . + . seteng 4.43%, 2.57422 Mbytes + . | . + . | d_common_vec 4.43%, 2.57422 Mbytes + . | . + . | . (d_common_vec exclusive) 4.43%, 2.57422 Mbytes + . | . + . (hydro_lanl_1 exclusive) 6.06%, 3.52344 Mbytes call tree stats: depth: 10 nodes: 438

flop data integrated into call tree Cycle: 7234, MPI rank: 0, all instrumentation levels dumping elapsed time performance data since start of run (inclusive nested calling tree) routines with average time/call > 10 us after 1000 calls skipping routines with < 3% wall clock time and < 1e+08 bytes sent/rcvd and < 1e+06 bytes written Peak Mflops: 4600.00 Mflops: 0.82 ( 0.02% of peak) controller 100.00%; 201.30 s(201.30 s); 1 call 164972896 FP_INS, 0.820 Mf/s, 0.072 s ( 0.04%) 141760805912 L1_DCA, 704 L1_DCA/us, 0.00116374 f/L1_DCA 2686875312 L1_DCM, 13 L1_DCM/us, 0.0613995 f/L1_DCM, 1.90% L1_DCM/L1_DCA 623965879 L2_DCM, 3 L2_DCM/us, 0.264394 f/L2_DCM, 23.22% L2_DCM/L2_DCA + controller_0 99.46%; 200.20 s(200.20 s); 1 call 164972808 FP_INS, 0.824 Mf/s, 0.072 s ( 0.04%) 140023030844 L1_DCA, 699 L1_DCA/us, 0.00117818 f/L1_DCA 2657992526 L1_DCM, 13 L1_DCM/us, 0.0620667 f/L1_DCM, 1.90% L1_DCM/L1_DCA 623906059 L2_DCM, 3 L2_DCM/us, 0.264419 f/L2_DCM, 23.47% L2_DCM/L2_DCA + . controller_3 96.56%; 194.38 s(194.38 s); 96 calls (2.025 s/call avg, 1.745 s min, 3.419 s max) 164674809 FP_INS, 0.847 Mf/s, 0.072 s ( 0.04%) 133280135424 L1_DCA, 686 L1_DCA/us, 0.00123555 f/L1_DCA 2527669073 L1_DCM, 13 L1_DCM/us, 0.0651489 f/L1_DCM, 1.90% L1_DCM/L1_DCA 621185981 L2_DCM, 3 L2_DCM/us, 0.265097 f/L2_DCM, 24.58% L2_DCM/L2_DCA + . | cycle 93.81%; 188.84 s(188.84 s); 95 calls (1.988 s/call avg, 1.738 s min, 2.199 s max) 132328667 FP_INS, 0.701 Mf/s, 0.058 s ( 0.03%) 129778334303 L1_DCA, 687 L1_DCA/us, 0.00101965 f/L1_DCA 2468965744 L1_DCM, 13 L1_DCM/us, 0.0535968 f/L1_DCM, 1.90% L1_DCM/L1_DCA 617737747 L2_DCM, 3 L2_DCM/us, 0.214215 f/L2_DCM, 25.02% L2_DCM/L2_DCA + . | . hydro 59.07%; 118.91 s(118.91 s); 95 calls (1.252 s/call avg, 1.228 s min, 1.379 s max) 78559433 FP_INS, 0.661 Mf/s, 0.034 s ( 0.03%) 71964911416 L1_DCA, 605 L1_DCA/us, 0.00109164 f/L1_DCA 1414105604 L1_DCM, 12 L1_DCM/us, 0.0555541 f/L1_DCM, 1.96% L1_DCM/L1_DCA 487050183 L2_DCM, 4 L2_DCM/us, 0.161296 f/L2_DCM, 34.44% L2_DCM/L2_DCA

Run Time by Package by Rank

Memory Allocation by Routine by Rank shows the affect of io processors (bulkio)

Rage Performance Tracking: Run time by Routine by Date (Code Version)

Rage Performance Tracking: tu vs. yr

Cache Performance by Routine

TLB Hit Rate by Routine

%peak by flops/memory reference (idea for this from Mack Kenamond Shavano L2 MS talk

Flop Rate by Routine by Rank

Memory Trace (rss) by Rank

Documentation, support, etc. deployed on LANL systems: lobo, tu, yr, hu, rt, rr LLNL systems: purple, bgl/dawn, linux clusters SNL systems: redstorm Integrated into major ASC codes at LANL and LLNL Basis for potential JOWOG performance code comparisons /usr/projects/codeopt/PERF/4.0/doc/ ReadMe GettingStarted My contact info: (505) 665-4655, jeffb@lanl.gov

PerfLib Overview

PerfLib Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview