HPCPI/Xtools Performance Analysis Toolset

HPCPI/Xtools Performance Analysis Toolset David LaFrance-Linden High Performance Computing Division

Overview • HPCPI • Statistical sampling profiler • From DCPI (Digital Continuous Profiling Infrastructure) • Compare (vaguely) to: • OProfile (open source): conceptually based on DCPI • Caliper (from HP for Itanium): has many other modes/features • Vtune (from Intel): with GUI • CodeAnalyst (from AMD): with GUI • Xtools • Performance visualization tools • xclus: cluster-wide visualization tool • xperf: node-specific visualization tool HPCPI/Xtools Performance Analysis Toolset

HPCPI – Standard sampling • Set default database location % setenv HPCPIDB ~/hpcpidb • Start daemon: % hpcpid Using info for 'AMD64 (family 0Fh)' PMU 1 tags, user definition: pretty formal interval duty randomize ------ ---------------- -------- ------ --------- Cycles CPU_CLK_UNHALTED 60000 always no maintainVCT = false 1 groups; user definition: # 0 1 2 3 1 CPU_CLK_UNHALTED <empty> <empty> <empty> ---- multiplexing interval = 1000000 ---- Logging to /tmp/david_ll/hpcpid-hpc6.log Daemon is running on pid 6310 HPCPI/Xtools Performance Analysis Toolset

HPCPI – Standard sampling • Run programs: % time ./mb_pi.O.exe -iters 100 3.1415926535897932384626433832795028 3.995u 0.000s 0:04.00 99.7% 0+0k 0+0io 0pf+0w % time ./mb_pi.g.exe -iters 100 3.1415926535897932384626433832795028 8.439u 0.000s 0:08.44 99.8% 0+0k 0+0io 0pf+0w • Flush database to disk % hpcpictl flush hpcpictl flush successful • Analyze % hpcpiprof % hpcpiprof ./mb_pi.g.exe ./mb_pi.O.exe % hpcpilist mandel_val ./mb_pi.g.exe HPCPI/Xtools Performance Analysis Toolset

hpcpiprof (by image) % hpcpiprof Event Name Events Period Samples ---------------- ------------ ------ ------- CPU_CLK_UNHALTED 163217580000 60000 2720293 CPU_CLK_ UNHALTED % cum% image -------- ----- ------ ---------------------------------- 136108e6 83.4% 83.4% vmlinux-2.6.9-34.7hp.XCsmp.o.hpcpi 18077e6 11.1% 94.5% mb_pi.g.exe 8618e6 5.3% 99.7% mb_pi.O.exe 345e6 0.2% 100.0% emacs 19e6 0.0% 100.0% libc-2.3.4.so 7e6 0.0% 100.0% Xorg 6e6 0.0% 100.0% tg3.ko 5e6 0.0% 100.0% libgobject-2.0.so.0.400.7 4e6 0.0% 100.0% libX11.so.6.2 4e6 0.0% 100.0% libgdk-x11-2.0.so.0.400.13 4e6 0.0% 100.0% libglib-2.0.so.0.400.7 2e6 0.0% 100.0% hald 2e6 0.0% 100.0% ld-2.3.4.so 2e6 0.0% 100.0% ohci_hcd.ko ... HPCPI/Xtools Performance Analysis Toolset

hpcpiprof (by procedure) % hpcpiprof ./mb_pi.O.exe ./mb_pi.g.exe Event Name Events Period Samples ---------------- ----------- ------ ------- CPU_CLK_UNHALTED 26694720000 60000 444912 CPU_CLK_ UNHALTED % cum% procedure image -------- ----- ------ --------------- ----------- 175927e5 65.9% 65.9% mandel_val mb_pi.g.exe 84917e5 31.8% 97.7% mandel_val mb_pi.O.exe 4840e5 1.8% 99.5% mb_fill_in_data mb_pi.g.exe 1264e5 0.5% 100.0% mb_fill_in_data mb_pi.O.exe HPCPI/Xtools Performance Analysis Toolset

hpcpilist (by source/assembly) % hpcpilist mandel_val mb_pi.g.exe Event Name Events Period ---------------- ----------- ------ CPU_CLK_UNHALTED 17592720000 60000 CPU_CLK_UNHALTED Source ---------------- ----------------------------------------------------- 34560e03 159: { 160: register NUMTYPE zr = cr, zi = ci, zr2, zi2; 161: register int n = 0, delta = nmax - nmin; 302220e03 162: register NUMTYPE rad2, four = CONSTANT4(4.0); 163: register int keepgoing; 365100e03 164: while (((n += 1), 165: (zr2 = MULT(zr,zr)), 166: (zi2 = MULT(zi,zi)), 167: (zi = 2*MULT(zr,zi) + ci), 168: (keepgoing = (n < nmax)), 169: (rad2 = zr2 + zi2), 170: (zr = zr2 - zi2 + cr), 171: (rad2 <= four)) 172: && keepgoing) 173: ; 16696e06 174: if (cause_segv && cr > CONSTANT(0.0) ...) { 175: if (1) cause_segv = 0; 176: *(char*)(long)cause_segv = 1; 177: } 103620e03 178: return (n >= nmax ? delta : 179: n < delta ? n : 180: n == delta ? 0 : 181: n%delta); 182: } HPCPI/Xtools Performance Analysis Toolset

HPCPI’s differentiators • Sample rate • Overhead • Features • Can sample more than 1 event • Can sample arbitrary number of events • The ‘label’ feature • Attention to accuracy HPCPI/Xtools Performance Analysis Toolset

Sample rate and overhead • Default sample rate higher (interval lower), minumum sample rate high in comparison: • Low overhead: (Itanium; can’t on x86_64) HPCPI/Xtools Performance Analysis Toolset

Feature: Can sample more than one event • Useful for deriving metrics at image, routine or loop level • So can OProfile and Vtune and CodeAnalyst, but not yet Caliper • IPC • CPU_CLK_UNHALTED • RETIRED_INSTRS HPCPI/Xtools Performance Analysis Toolset

Example: IPC for ‘mb_pi’ • Restart: % hpcpictl quit hpcpictl quit successful % hpcpid -events IPCEvents Using info for 'AMD64 (family 0Fh)' PMU 2 tags, user definition: pretty formal interval duty randomize ------- ---------------- -------- ------ --------- Cycles CPU_CLK_UNHALTED 60000 always no Retired RETIRED_INSTRS 60000 1 no maintainVCT = false 1 groups; user definition: # 0 1 2 3 1 CPU_CLK_UNHALTED RETIRED_INSTRS <empty> <empty> ---- multiplexing interval = 1000000 ---- Logging to /tmp/david_ll/hpcpid-hpc6.log Daemon is running on pid 6365 HPCPI/Xtools Performance Analysis Toolset

Example: IPC for ‘mb_pi’ • Collect and report: %./mb_pi.O.exe -iters 100 3.1415926535897932384626433832795028 %./mb_pi.g.exe -iters 100 3.1415926535897932384626433832795028 % hpcpictl flush hpcpictl flush successful % hpcpiprof mb_pi.g.exe mb_pi.O.exe Event Name Events Period Samples ---------------- ----------- ------ ------- CPU_CLK_UNHALTED 27032280000 60000 450538 RETIRED_INSTRS 25146600000 60000 419110 CPU_CLK_ RETIRED_ UNHALTED % cum% INSTRS procedure image -------- ----- ------ -------- --------------- ----------- 177810e5 65.8% 65.8% 145161e5 mandel_val mb_pi.g.exe 86298e5 31.9% 97.7% 100236e5 mandel_val mb_pi.O.exe 5000e5 1.8% 99.6% 4334e5 mb_fill_in_data mb_pi.g.exe 1213e5 0.4% 100.0% 1735e5 mb_fill_in_data mb_pi.O.exe 1e5 0.0% 100.0% 0 main mb_pi.g.exe 1e5 0.0% 100.0% 0 main mb_pi.O.exe % tclsh % expr {145161e5 / 177810e5} 0.816382655644 % expr {100236e5 / 86298e5} 1.16151011611 HPCPI/Xtools Performance Analysis Toolset

Multiplex arbitrary events • DCacheEvents • CPU_CLK_UNHALTED • RETIRED_INSTRS • DISPATCH_STALLS • DATA_CACHE_ACCESSES • DATA_CACHE_MISSES • DATA_CACHE_REFILLS_FROM_L2.ALL • DATA_CACHE_REFILLS_FROM_SYSTEM.ALL • DATA_CACHE_LINES_EVICTED.ALL • L1DTLB_MISS_L2DTLB_HIT • L1DTLB_AND_L2DTLB_MISS • L2_REQUESTS.ALL • L2_MISSES.ALL • L2_FILLS.ALL • Why not just do them all? • Or more! • Unique to HPCPI HPCPI/Xtools Performance Analysis Toolset

DCacheEvents event set on dear_rate • Setup: % hpcpid -events DCacheEvents Using info for 'AMD64 (family 0Fh)' PMU 13 tags, user definition: pretty formal interval duty randomize --------- ---------------------------------- -------- ------ --------- Cycles CPU_CLK_UNHALTED 60000 always no Retired RETIRED_INSTRS 60000 1 no StallsAll DISPATCH_STALLS 60000 1 no DATA_CACHE_ACCESSES 12000 1 no DATA_CACHE_MISSES 12000 1 no DATA_CACHE_REFILLS_FROM_L2.ALL 12000 1 no DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 12000 1 no DATA_CACHE_LINES_EVICTED.ALL 12000 1 no L1DTLB_MISS_L2DTLB_HIT 12000 1 no L1DTLB_AND_L2DTLB_MISS 12000 1 no L2_REQUESTS.ALL 12000 1 no L2_MISSES.ALL 12000 1 no L2_FILLS.ALL 12000 1 no maintainVCT = false 4 groups; user definition: # 0 1 2 3 1 CPU_CLK_UNHALTED RETIRED_INSTRS DISPATCH_STALLS DATA_CACHE_ACCESSES 2 CPU_CLK_UNHALTED DATA_CACHE_MISSES DATA_CACHE_REFILLS_FROM_L2.ALL DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 3 CPU_CLK_UNHALTED DATA_CACHE_LINES_EVICTED.ALL L1DTLB_MISS_L2DTLB_HIT L1DTLB_AND_L2DTLB_MISS 4 CPU_CLK_UNHALTED L2_REQUESTS.ALL L2_MISSES.ALL L2_FILLS.ALL ---- multiplexing interval = 1000000 ---- Logging to /tmp/david_ll/hpcpid-hpc6.log Daemon is running on pid 6877 HPCPI/Xtools Performance Analysis Toolset

DCacheEvents event set on dear_rate • Run: % (limit cpu 20sec; ./dear_rate.x86_64-Linux.exe 8 4 20 91) 8 reads in blocks of 4, increment 91 20000 iters in 19366069 cycles = 968.30 cycles/iter, 145.36MB/sec 40000 iters in 37588341 cycles = 939.71 cycles/iter, 149.80MB/sec 160000 iters in 144011554 cycles = 900.07 cycles/iter, 156.42MB/sec 1111987 iters in 999881564 cycles = 899.18 cycles/iter, 156.58MB/sec 1111987 iters in 1000096567 cycles = 899.38 cycles/iter, 156.54MB/sec 1111987 iters in 999305772 cycles = 898.67 cycles/iter, 156.67MB/sec Cputime limit exceeded • Flush: % hpcpictl flush hpcpictl flush successful HPCPI/Xtools Performance Analysis Toolset

DCacheEvents event set on dear_rate • Observe: % hpcpiprof ./dear_rate.x86_64-Linux.exe Event Name Events Period Samples Active Fraction ---------------------------------- ----------- ------ ------- --------------- CPU_CLK_UNHALTED 42096660000 60000 701611 100.00% RETIRED_INSTRS 2004960000 60000 8354 25.00% DISPATCH_STALLS 41377200000 60000 172405 25.00% DATA_CACHE_ACCESSES 729216000 12000 15192 25.00% DATA_CACHE_MISSES 395664000 12000 8243 25.00% DATA_CACHE_REFILLS_FROM_L2.ALL 394944000 12000 8228 25.00% DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 389664000 12000 8118 25.00% DATA_CACHE_LINES_EVICTED.ALL 784080000 12000 16335 25.00% L1DTLB_MISS_L2DTLB_HIT 4560000 12000 95 25.00% L1DTLB_AND_L2DTLB_MISS 71712000 12000 1494 25.00% L2_REQUESTS.ALL 490704000 12000 10223 25.00% L2_MISSES.ALL 390336000 12000 8132 25.00% L2_FILLS.ALL 395904000 12000 8248 25.00% DATA_CACHE_ DATA_CACHE_ DATA_CACHE_ REFILLS_ REFILLS_ LINES_ L1DTLB_ CPU_CLK_ RETIRED_ DISPATCH_ DATA_CACHE_ DATA_CACHE_ FROM_L2 FROM_SYSTEM EVICTED MISS_ L1DTLB_AND_ L2_REQUESTS L2_MISSES L2_FILLS UNHALTED % cum% INSTRS STALLS ACCESSES MISSES .ALL .ALL .ALL L2DTLB_HIT L2DTLB_MISS .ALL .ALL .ALL procedure image -------- ------ ------ -------- --------- ----------- ----------- ----------- ----------- ----------- ---------- ----------- ----------- --------- -------- ------------- -------------------------- 420965e5 100.0% 100.0% 20050e5 413772e5 7292e5 3957e5 3949e5 3897e5 7841e5 46e5 717e5 4907e5 3903e5 3959e5 read_memory8d dear_rate.x86_64-Linux.exe 1e5 0.0% 100.0% 0 0 0e5 0 0 0 0 0 0 0 0 0 main dear_rate.x86_64-Linux.exe HPCPI/Xtools Performance Analysis Toolset

The ‘label’ feature • Partitions samples, usually based on process(es) • See the man page for hpcpilabel • DCPI classic label: % hpcpictl label run1 a.out one 1 uno % hpcpictl label run2 a.out two 2 dos • Restrict to a script and its children: % hpcpictl label specs –pgid this runSpec • Snapshot a system-wide interval: % hpcpictl label oneMinute –pid -1 –not sleep 60 • “Attach” to a process % hpcpictl label attached –pid desiredPID sleep 99999 • Monitor the idle process on CPU 0 for 5 minutes: % hpcpictl label pid0cpu0 –pid 0 –cpu 0 –and sleep 300 • Can be initiated and managed by programs • Use popen() of hpcpictl with ‘–pgid this’ or ‘-pidparent’ • Don’t forget to hpcpictl flush • Use ‘-label labelName’ with the analysis tools HPCPI/Xtools Performance Analysis Toolset

Attention to accuracy (Itanium) • Wrote micro-benchmarks with known behavior • Eliminated post-unfreeze-pre-RFI event leaks • Micro-benchmark has no NOPS nor any predicate-squashed instructions • Determined event-based multiplexing better than time-based • Micro-benchmark has known (high) IPC HPCPI/Xtools Performance Analysis Toolset

Xtools • Pair of visualization tools • Separable and cooperative with HPCPI • xclus • Cluster-wide monitoring • Utilizations: CPU, DRAM, HyperTransports • xperf • Single-node monitoring • Graphs of derived events based on hardware counters • CPU utilization, IPC, cycle accounting, cache penalties, I/O activity, etc HPCPI/Xtools Performance Analysis Toolset

D R A M CPU/Core CPU/Core D R A M CPU/Core CPU/Core 1/11 10/11 0 1 2 N-3 N-2 N-1 … MPI_Finalize() Basic structure of a system For icon-design of xclus: • Processors with CPUs/cores • Local DRAM • HyperTransports mpi_two_way 1/11: • Outer processes exchange data • Rank 0 sends 1/11 to rank n-1; receives 10/11 • Rank n-1 sends 10/11 to rank 0; receives 1/11 • 1:10 ratio; easy to see HPCPI/Xtools Performance Analysis Toolset

xclus, cluster running mpi_two_way; default: Utilization HPCPI/Xtools Performance Analysis Toolset

xclus, cluster running mpi_two_way; show bandwidth (control and data) HPCPI/Xtools Performance Analysis Toolset

xclus, cluster running mpi_two_way; show bandwidth, data only HPCPI/Xtools Performance Analysis Toolset

xclus, cluster running mpi_two_way; wave-over pop-up HPCPI/Xtools Performance Analysis Toolset

1/11  1/11  1/11  1/11  10/11 10/11 10/11 10/11 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 8 8 8 8 9 9 9 9 A A A A B B B B MPI_Finalize() MPI_Finalize() MPI_Finalize() MPI_Finalize() xclus node grouping Run mpi_two_way on 12 processes (3 nodes), 4 times • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 Without node grouping: • xclus [ –no-group-nodes ] Force node grouping: • xclus –group-nodes HPCPI/Xtools Performance Analysis Toolset

xclus, (4x3)x4 mpi_two_way, not grouped HPCPI/Xtools Performance Analysis Toolset

xclus, (4x3)x4 mpi_two_way, grouped HPCPI/Xtools Performance Analysis Toolset

xclus, (4x3)x4 mpi_two_way, grouped, node waveover HPCPI/Xtools Performance Analysis Toolset

xclus, (4x3)x4 mpi_two_way, grouped, HyperTransport waveover HPCPI/Xtools Performance Analysis Toolset

xperf: node-specific time-graphs of counter-based metrics • xperf can be started by itself • xperf –nodenodename • Or by clicking a node in xclus • Demonstration program: memr_rate • Run in such a way that • On CPU 0: hits in L1 cache, gets 12+ GB/sec • On CPU 1: misses L1, hits L2, gets 2 GB/sec • On CPU 2: starts missing L2, gets 700 MB/sec • On CPU 3: misses L2, gets 400 MB/sec HPCPI/Xtools Performance Analysis Toolset

xperf, initially HPCPI/Xtools Performance Analysis Toolset

xperf, hide several graphs HPCPI/Xtools Performance Analysis Toolset

xperf, with tear-off color keys HPCPI/Xtools Performance Analysis Toolset

xperf: HPCPI’s initial presentation HPCPI/Xtools Performance Analysis Toolset

xperf/HPCPI, top image HPCPI/Xtools Performance Analysis Toolset

xperf/HPCPI, top procedure HPCPI/Xtools Performance Analysis Toolset

xperf/HPCPI, top procedure, scrolled HPCPI/Xtools Performance Analysis Toolset

Recap • HPCPI • Sampling profiler • High frequency • Low overhead (Itanium) • Arbitrary events (auto-placement, auto-multiplexing) • Attention to accuracy • ‘label’ feature • Xtools • xclus: cluster-wide utilization visualizer • xperf: node-specific time-graphs of counter-based metrics • Integrated with HPCPI HPCPI/Xtools Performance Analysis Toolset

[End] • Questions? • Discussion? • Break • Next: HPCPI documentation review HPCPI/Xtools Performance Analysis Toolset

HPCPI/Xtools Performance Analysis Toolset

HPCPI/Xtools Performance Analysis Toolset

Presentation Transcript

Operations Research Modeling Toolset

Analysis Of Stripped Binary Code

Cost Management Measuring, Monitoring, and Motivating Performance

Performance analysis for high speed switches

Financial Statement Analysis

10. Uncertainty Analysis

Particle Size Analysis

PERFORMANCE ANALYSIS OF COAL MILLS

FY2007 Federal Performance for Staff

Financial Statement Analysis

Lecture 2 Reservoir Capacity Yield Analysis (how to size a reservoir and measure its performance)

Corporate Performance Management

Quantitative Performance Analysis

Performance Evaluation Course

BW 2.x/3.0 Performance

Data Analysis Retreat: May 25-27, 2011

Performance Comparisons for Schools in Madison County

3.2 Cognitive Task Analysis

Chapter 8 Performance Analysis of Alpha-Beta Pruning

H.264 decoder

Outline