slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
PerfLib Overview PowerPoint Presentation
Download Presentation
PerfLib Overview

Loading in 2 Seconds...

play fullscreen
1 / 20

PerfLib Overview - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

PerfLib Overview. Jeff Brown LANL/CCS-7 July 28, 2010. PerfLib is . a performance measurement and analysis tool which consists of two components: - a run-time library (libperfrt), and - post processing library, scripts, and utility programs. How to use PerfLib. 1. instrument the code

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'PerfLib Overview' - vila


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

PerfLib Overview

Jeff Brown

LANL/CCS-7

July 28, 2010

slide2

PerfLib is ...

a performance measurement and analysis tool which consists of two components:

- a run-time library (libperfrt), and

- post processing library, scripts, and utility programs

slide3

How to use PerfLib

1. instrument the code

leverage existing timing infra (rage, flag, partisn, ...)

instrumentation moves with the code to new systems

calipers – must be “well formed” (beware alt. returns)

2. build the code linking the run-time library

minor modifications to the build system

3. run code to collect data

trigger data collection via:

- environment variable settings or

- library calls

4. post process to analyse results (runpp, perfpp, etc.)

slide4

Performance Data Collected

Profiling:

- timing

- MPI (via profiling interface – no additional instr req.

- hardware counters (e.g. flops, cache, tlb, etc.)

- memory allocation (RSS footprint)

- IO

Tracing:

- timing/MPI

- memory allocation

slide5

Post Processing

perfpp: a do it all script that discovers what data was

collected and generates all possible reports/plots

runpp: just generates the reports

some sample reports and plots ...

slide6

PerfLib Overhead Tracked:

~2% for timing/MPI

12% for hardware counters, 21% for memory allocation

slide7

PerfLib Header Report for64 PE Comet Impact Run on tu

PERFlib version 3.0

Data path: /scratch2/jeffb/tu/PerfTrack/3.0/crestone/timing/user_problems/comet/2.56km/48/64/20100324

Data directory: comet-20100324

Performance data header:

run date: 20100324

OS version: Linux

code version: xrage.1003.00

compiled with: unknown

MPI version: /usr/projects/packages/openmpi/tu133/openmpi-intel-1.3.3

problem name: comet.input

hosts: 64 processes running on (all processors with 2.30GHz cpu, 512 KB level 2 cache, 1024 TLB entries, 32189.25GB physical memory)

tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain

tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain

tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain

tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain, tua043:tua043.localdomain

tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain

tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain

tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain

tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain, tua043:tua044.localdomain

tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain

tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain

tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain

tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain, tua043:tua046.localdomain

tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain

tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain

tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain

tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain, tua043:tua051.localdomain

Profile metrics:

Time (wall clock) enabled (PAPI timer)

Counters (hardware performance data) not enabled (to enable, setenv PERF_PROFILE_COUNTERS)

Memory not enabled (to enable, setenv PERF_PROFILE_MEMORY)

MPI enabled

IO enabled

Trace metrics:

Memory not enabled (to enable, setenv PERF_TRACE_MEMORY)

Performance data dump frequency - every 10 cycles

slide8

Tuesday night comet impact run on tu

timing report filtered at 3%

rank 0, last cycle cumulative

Cycle: 7234, MPI rank: 0, all instrumentation levels

dumping elapsed time performance data since start of run (inclusive nested calling tree)

routines with average time/call > 10 us after 1000 calls

skipping routines with < 3% wall clock time and

< 1e+08 bytes sent/rcvd and

< 1e+06 bytes written

controller 100.00%; 177.89 s(177.89 s); 1 call

+ controller_0 99.92%; 177.74 s(177.74 s); 1 call

+ . controller_3 96.71%; 172.04 s(172.04 s); 96 calls (1.792 s/call avg, 1.324 s min, 3.130 s max)

+ . | cycle 94.46%; 168.04 s(168.04 s); 95 calls (1.769 s/call avg, 1.578 s min, 2.005 s max)

+ . | . hydro 65.71%; 116.89 s(116.89 s); 95 calls (1.230 s/call avg, 1.210 s min, 1.363 s max)

+ . | . + cdt 10.13%; 18.025 s(18.025 s); 285 calls (0.063 s/call avg, 0.056 s min, 0.072 s max)

+ . | . + . xmeos 6.10%; 10.859 s(10.859 s); 285 calls (0.038 s/call avg, 0.035 s min, 0.046 s max)

+ . | . + . token_allreduce 3.29%; 5.847 s(5.847 s); 285 calls (0.021 s/call avg, 0.015 s min, 0.028 s max)

+ . | . + . | MPI_Allreduce 3.28%; 5.842 s(5.842 s); 285 calls (0.020 s/call avg, 0.015 s min, 0.027 s max)

2.23 KB sent (avg: 8 B, BW: 0.000372168 MB/s);

2.23 KB rcvd (avg: 8 B, BW: 0.000372168 MB/s)

+ . | . + hydro_lanl_1 55.53%; 98.774 s(98.774 s); 190 calls (0.520 s/call avg, 0.507 s min, 0.586 s max)

+ . | . + . d_common 19.79%; 35.206 s(35.206 s); 3420 calls (0.010 s/call avg, 8739 us min, 0.043 s max)

+ . | . + . | . MPI_Irecv 0.02%; 0.038 s(0.038 s); 17064 calls (0.000002 s/call avg, 1 us min, 21 us max)

99046.69 KB rcvd (avg: 5943.73 B, BW: 2516 MB/s)

+ . | . + . | (d_common exclusive) 16.12%; 28.673 s(28.673 s);

+ . | . + . h_1_advect_vol 5.53%; 9.829 s(9.829 s); 190 calls (0.052 s/call avg, 0.050 s min, 0.060 s max)

+ . | . + . | d_common_vec 4.48%; 7.965 s(7.965 s); 190 calls (0.042 s/call avg, 0.041 s min, 0.047 s max)

+ . | . + . | . (d_common_vec exclusive) 4.29%; 7.639 s(7.639 s);

+ . | . + . d_fvol 8.18%; 14.558 s(14.558 s); 1140 calls (0.013 s/call avg, 0.011 s min, 0.028 s max)

+ . | . + . | (d_fvol exclusive) 7.26%; 12.912 s(12.912 s);

+ . | . + . seteng 5.77%; 10.272 s(10.272 s); 190 calls (0.054 s/call avg, 0.054 s min, 0.061 s max)

+ . | . + . (hydro_lanl_1 exclusive) 11.26%; 20.025 s(20.025 s);

+ . | . calscr 3.32%; 5.906 s(5.906 s); 95 calls (0.062 s/call avg, 0.061 s min, 0.075 s max)

+ . | . freeze_restore 3.97%; 7.066 s(7.066 s); 95 calls (0.074 s/call avg, 0.071 s min, 0.086 s max)

+ . | . recon 17.63%; 31.362 s(31.362 s); 95 calls (0.330 s/call avg, 0.154 s min, 0.470 s max)

+ . | . + cdt 3.04%; 5.414 s(5.414 s); 95 calls (0.057 s/call avg, 0.056 s min, 0.060 s max)

+ . | . + cell_get 3.18%; 5.650 s(5.650 s); 31480 calls (0.000179 s/call avg, 51 us min, 2269 us max)

+ . | . + . token_get 3.04%; 5.400 s(5.400 s); 31480 calls (0.000172 s/call avg, 47 us min, 2263 us max)

+ . | . + . | MPI_Issend 0.08%; 0.150 s(0.150 s); 47160 calls (0.000003 s/call avg, 1 us min, 69 us max)

144710.39 KB sent (avg: 3142.14 B, BW: 943.477 MB/s)

+ . | . cdt 3.24%; 5.763 s(5.763 s); 95 calls (0.061 s/call avg, 0.060 s min, 0.063 s max)

+ . | . + . | pwrite 0.04%; 0.067 s(0.067 s); 13 calls (0.005147 s/call avg, 4546 us min, 5696 us max)

65.00 MB written @ 971.3671 MB/s

call tree stats:

depth: 11

nodes: 794

slide9

memory allocation report

Cycle: 7234, MPI rank: 0, all instrumentation levels

dumping memory allocation performance data since start of run (inclusive nested calling tree)

physical memory on node: 32189.25 MBytes (hostname: tua023:tua023.localdomain)

memory allocated (rss): 102.95 MBytes (44.8359 MBytes allocated prior to 1st instrumentation point)

minimum free memory: 26665.55 MBytes (82.84%)

total page faults: 71 (22 page faults prior to 1st instrumentation point)

dumping memory allocation performance data (rss growth) - inclusive nested calling tree

skipping routines with < 3% allocated memory (rss growth) and

< 3% page faults

controller 100.00%, 58.1094 Mbytes, 49 page faults

+ controller_0 97.22%, 56.4922 Mbytes, 47 page faults

+ . tread 34.43%, 20.0078 Mbytes, 2 page faults

+ . | pio_open 16.61%, 9.65234 Mbytes

+ . | . universal_file_read_common 15.45%, 8.97656 Mbytes

+ . | . + bulkio_read_s 8.69%, 5.05078 Mbytes

+ . | . + . bulkio_read_d 8.69%, 5.05078 Mbytes

+ . | . + . | pread 8.60%, 5 Mbytes

+ . | . + token_bcast 6.72%, 3.90234 Mbytes

+ . | resize 15.51%, 9.01172 Mbytes, 1 page faults

+ . | . (resize exclusive) 15.49%, 9.00391 Mbytes, 1 page faults

+ . restart 30.14%, 17.5156 Mbytes, 10 page faults

+ . | bldint 10.43%, 6.05859 Mbytes

+ . | . resize 6.60%, 3.83594 Mbytes

+ . | . + (resize exclusive) 6.60%, 3.83594 Mbytes

+ . | seteng 5.17%, 3.00391 Mbytes

+ . | (restart exclusive) 10.78%, 6.26562 Mbytes, 8 page faults

+ . controller_3 29.69%, 17.25 Mbytes, 21 page faults

+ . | cycle 27.42%, 15.9336 Mbytes

+ . | . hydro 25.16%, 14.6211 Mbytes

+ . | . + hydro_lanl_1 22.63%, 13.1484 Mbytes

+ . | . + . d_common 4.45%, 2.58594 Mbytes

+ . | . + . | (d_common exclusive) 4.30%, 2.5 Mbytes

+ . | . + . h_1_advect_vol 7.32%, 4.25391 Mbytes

+ . | . + . | d_common_vec 7.31%, 4.24609 Mbytes

+ . | . + . | . (d_common_vec exclusive) 7.10%, 4.125 Mbytes

+ . | . + . seteng 4.43%, 2.57422 Mbytes

+ . | . + . | d_common_vec 4.43%, 2.57422 Mbytes

+ . | . + . | . (d_common_vec exclusive) 4.43%, 2.57422 Mbytes

+ . | . + . (hydro_lanl_1 exclusive) 6.06%, 3.52344 Mbytes

call tree stats:

depth: 10

nodes: 438

slide10

flop data integrated into call tree

Cycle: 7234, MPI rank: 0, all instrumentation levels

dumping elapsed time performance data since start of run (inclusive nested calling tree)

routines with average time/call > 10 us after 1000 calls

skipping routines with < 3% wall clock time and

< 1e+08 bytes sent/rcvd and

< 1e+06 bytes written

Peak Mflops: 4600.00

Mflops: 0.82 ( 0.02% of peak)

controller 100.00%; 201.30 s(201.30 s); 1 call

164972896 FP_INS, 0.820 Mf/s, 0.072 s ( 0.04%)

141760805912 L1_DCA, 704 L1_DCA/us, 0.00116374 f/L1_DCA

2686875312 L1_DCM, 13 L1_DCM/us, 0.0613995 f/L1_DCM, 1.90% L1_DCM/L1_DCA

623965879 L2_DCM, 3 L2_DCM/us, 0.264394 f/L2_DCM, 23.22% L2_DCM/L2_DCA

+ controller_0 99.46%; 200.20 s(200.20 s); 1 call

164972808 FP_INS, 0.824 Mf/s, 0.072 s ( 0.04%)

140023030844 L1_DCA, 699 L1_DCA/us, 0.00117818 f/L1_DCA

2657992526 L1_DCM, 13 L1_DCM/us, 0.0620667 f/L1_DCM, 1.90% L1_DCM/L1_DCA

623906059 L2_DCM, 3 L2_DCM/us, 0.264419 f/L2_DCM, 23.47% L2_DCM/L2_DCA

+ . controller_3 96.56%; 194.38 s(194.38 s); 96 calls (2.025 s/call avg, 1.745 s min, 3.419 s max)

164674809 FP_INS, 0.847 Mf/s, 0.072 s ( 0.04%)

133280135424 L1_DCA, 686 L1_DCA/us, 0.00123555 f/L1_DCA

2527669073 L1_DCM, 13 L1_DCM/us, 0.0651489 f/L1_DCM, 1.90% L1_DCM/L1_DCA

621185981 L2_DCM, 3 L2_DCM/us, 0.265097 f/L2_DCM, 24.58% L2_DCM/L2_DCA

+ . | cycle 93.81%; 188.84 s(188.84 s); 95 calls (1.988 s/call avg, 1.738 s min, 2.199 s max)

132328667 FP_INS, 0.701 Mf/s, 0.058 s ( 0.03%)

129778334303 L1_DCA, 687 L1_DCA/us, 0.00101965 f/L1_DCA

2468965744 L1_DCM, 13 L1_DCM/us, 0.0535968 f/L1_DCM, 1.90% L1_DCM/L1_DCA

617737747 L2_DCM, 3 L2_DCM/us, 0.214215 f/L2_DCM, 25.02% L2_DCM/L2_DCA

+ . | . hydro 59.07%; 118.91 s(118.91 s); 95 calls (1.252 s/call avg, 1.228 s min, 1.379 s max)

78559433 FP_INS, 0.661 Mf/s, 0.034 s ( 0.03%)

71964911416 L1_DCA, 605 L1_DCA/us, 0.00109164 f/L1_DCA

1414105604 L1_DCM, 12 L1_DCM/us, 0.0555541 f/L1_DCM, 1.96% L1_DCM/L1_DCA

487050183 L2_DCM, 4 L2_DCM/us, 0.161296 f/L2_DCM, 34.44% L2_DCM/L2_DCA

slide12

Memory Allocation by Routine by Rank

shows the affect of io processors (bulkio)

slide17

%peak by flops/memory reference

(idea for this from Mack Kenamond Shavano L2 MS talk

slide20

Documentation, support, etc.

deployed on

LANL systems: lobo, tu, yr, hu, rt, rr

LLNL systems: purple, bgl/dawn, linux clusters

SNL systems: redstorm

Integrated into major ASC codes at LANL and LLNL

Basis for potential JOWOG performance code comparisons

/usr/projects/codeopt/PERF/4.0/doc/

ReadMe

GettingStarted

My contact info: (505) 665-4655, jeffb@lanl.gov