hpcpi xtools performance analysis toolset
Download
Skip this Video
Download Presentation
HPCPI/Xtools Performance Analysis Toolset

Loading in 2 Seconds...

play fullscreen
1 / 39

HPCPI/Xtools Performance Analysis Toolset - PowerPoint PPT Presentation


  • 229 Views
  • Uploaded on

HPCPI/Xtools Performance Analysis Toolset. David LaFrance-Linden High Performance Computing Division. Overview. HPCPI Statistical sampling profiler From DCPI (Digital Continuous Profiling Infrastructure) Compare (vaguely) to: OProfile (open source): conceptually based on DCPI

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'HPCPI/Xtools Performance Analysis Toolset' - geona


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hpcpi xtools performance analysis toolset

HPCPI/Xtools Performance Analysis Toolset

David LaFrance-Linden

High Performance Computing Division

overview
Overview
  • HPCPI
    • Statistical sampling profiler
    • From DCPI (Digital Continuous Profiling Infrastructure)
    • Compare (vaguely) to:
      • OProfile (open source): conceptually based on DCPI
      • Caliper (from HP for Itanium): has many other modes/features
      • Vtune (from Intel): with GUI
      • CodeAnalyst (from AMD): with GUI
  • Xtools
    • Performance visualization tools
    • xclus: cluster-wide visualization tool
    • xperf: node-specific visualization tool

HPCPI/Xtools Performance Analysis Toolset

hpcpi standard sampling
HPCPI – Standard sampling
  • Set default database location

% setenv HPCPIDB ~/hpcpidb

  • Start daemon:

% hpcpid

Using info for 'AMD64 (family 0Fh)' PMU

1 tags, user definition:

pretty formal interval duty randomize

------ ---------------- -------- ------ ---------

Cycles CPU_CLK_UNHALTED 60000 always no

maintainVCT = false

1 groups; user definition:

# 0 1 2 3

1 CPU_CLK_UNHALTED

----

multiplexing interval = 1000000

----

Logging to /tmp/david_ll/hpcpid-hpc6.log

Daemon is running on pid 6310

HPCPI/Xtools Performance Analysis Toolset

hpcpi standard sampling1
HPCPI – Standard sampling
  • Run programs:

% time ./mb_pi.O.exe -iters 100

3.1415926535897932384626433832795028

3.995u 0.000s 0:04.00 99.7% 0+0k 0+0io 0pf+0w

% time ./mb_pi.g.exe -iters 100

3.1415926535897932384626433832795028

8.439u 0.000s 0:08.44 99.8% 0+0k 0+0io 0pf+0w

  • Flush database to disk

% hpcpictl flush

hpcpictl flush successful

  • Analyze

% hpcpiprof

% hpcpiprof ./mb_pi.g.exe ./mb_pi.O.exe

% hpcpilist mandel_val ./mb_pi.g.exe

HPCPI/Xtools Performance Analysis Toolset

hpcpiprof by image
hpcpiprof (by image)

% hpcpiprof

Event Name Events Period Samples

---------------- ------------ ------ -------

CPU_CLK_UNHALTED 163217580000 60000 2720293

CPU_CLK_

UNHALTED % cum% image

-------- ----- ------ ----------------------------------

136108e6 83.4% 83.4% vmlinux-2.6.9-34.7hp.XCsmp.o.hpcpi

18077e6 11.1% 94.5% mb_pi.g.exe

8618e6 5.3% 99.7% mb_pi.O.exe

345e6 0.2% 100.0% emacs

19e6 0.0% 100.0% libc-2.3.4.so

7e6 0.0% 100.0% Xorg

6e6 0.0% 100.0% tg3.ko

5e6 0.0% 100.0% libgobject-2.0.so.0.400.7

4e6 0.0% 100.0% libX11.so.6.2

4e6 0.0% 100.0% libgdk-x11-2.0.so.0.400.13

4e6 0.0% 100.0% libglib-2.0.so.0.400.7

2e6 0.0% 100.0% hald

2e6 0.0% 100.0% ld-2.3.4.so

2e6 0.0% 100.0% ohci_hcd.ko

...

HPCPI/Xtools Performance Analysis Toolset

hpcpiprof by procedure
hpcpiprof (by procedure)

% hpcpiprof ./mb_pi.O.exe ./mb_pi.g.exe

Event Name Events Period Samples

---------------- ----------- ------ -------

CPU_CLK_UNHALTED 26694720000 60000 444912

CPU_CLK_

UNHALTED % cum% procedure image

-------- ----- ------ --------------- -----------

175927e5 65.9% 65.9% mandel_val mb_pi.g.exe

84917e5 31.8% 97.7% mandel_val mb_pi.O.exe

4840e5 1.8% 99.5% mb_fill_in_data mb_pi.g.exe

1264e5 0.5% 100.0% mb_fill_in_data mb_pi.O.exe

HPCPI/Xtools Performance Analysis Toolset

hpcpilist by source assembly
hpcpilist (by source/assembly)

% hpcpilist mandel_val mb_pi.g.exe

Event Name Events Period

---------------- ----------- ------

CPU_CLK_UNHALTED 17592720000 60000

CPU_CLK_UNHALTED Source

---------------- -----------------------------------------------------

34560e03 159: {

160: register NUMTYPE zr = cr, zi = ci, zr2, zi2;

161: register int n = 0, delta = nmax - nmin;

302220e03 162: register NUMTYPE rad2, four = CONSTANT4(4.0);

163: register int keepgoing;

365100e03 164: while (((n += 1),

165: (zr2 = MULT(zr,zr)),

166: (zi2 = MULT(zi,zi)),

167: (zi = 2*MULT(zr,zi) + ci),

168: (keepgoing = (n < nmax)),

169: (rad2 = zr2 + zi2),

170: (zr = zr2 - zi2 + cr),

171: (rad2 <= four))

172: && keepgoing)

173: ;

16696e06 174: if (cause_segv && cr > CONSTANT(0.0) ...) {

175: if (1) cause_segv = 0;

176: *(char*)(long)cause_segv = 1;

177: }

103620e03 178: return (n >= nmax ? delta :

179: n < delta ? n :

180: n == delta ? 0 :

181: n%delta);

182: }

HPCPI/Xtools Performance Analysis Toolset

hpcpi s differentiators
HPCPI’s differentiators
  • Sample rate
  • Overhead
  • Features
    • Can sample more than 1 event
    • Can sample arbitrary number of events
    • The ‘label’ feature
  • Attention to accuracy

HPCPI/Xtools Performance Analysis Toolset

sample rate and overhead
Sample rate and overhead
  • Default sample rate higher (interval lower), minumum sample rate high in comparison:
  • Low overhead: (Itanium; can’t on x86_64)

HPCPI/Xtools Performance Analysis Toolset

feature can sample more than one event
Feature: Can sample more than one event
  • Useful for deriving metrics at image, routine or loop level
  • So can OProfile and Vtune and CodeAnalyst, but not yet Caliper
  • IPC
    • CPU_CLK_UNHALTED
    • RETIRED_INSTRS

HPCPI/Xtools Performance Analysis Toolset

example ipc for mb pi
Example: IPC for ‘mb_pi’
  • Restart:

% hpcpictl quit

hpcpictl quit successful

% hpcpid -events IPCEvents

Using info for 'AMD64 (family 0Fh)' PMU

2 tags, user definition:

pretty formal interval duty randomize

------- ---------------- -------- ------ ---------

Cycles CPU_CLK_UNHALTED 60000 always no

Retired RETIRED_INSTRS 60000 1 no

maintainVCT = false

1 groups; user definition:

# 0 1 2 3

1 CPU_CLK_UNHALTED RETIRED_INSTRS

----

multiplexing interval = 1000000

----

Logging to /tmp/david_ll/hpcpid-hpc6.log

Daemon is running on pid 6365

HPCPI/Xtools Performance Analysis Toolset

example ipc for mb pi1
Example: IPC for ‘mb_pi’
  • Collect and report:

%./mb_pi.O.exe -iters 100

3.1415926535897932384626433832795028

%./mb_pi.g.exe -iters 100

3.1415926535897932384626433832795028

% hpcpictl flush

hpcpictl flush successful

% hpcpiprof mb_pi.g.exe mb_pi.O.exe

Event Name Events Period Samples

---------------- ----------- ------ -------

CPU_CLK_UNHALTED 27032280000 60000 450538

RETIRED_INSTRS 25146600000 60000 419110

CPU_CLK_ RETIRED_

UNHALTED % cum% INSTRS procedure image

-------- ----- ------ -------- --------------- -----------

177810e5 65.8% 65.8% 145161e5 mandel_val mb_pi.g.exe

86298e5 31.9% 97.7% 100236e5 mandel_val mb_pi.O.exe

5000e5 1.8% 99.6% 4334e5 mb_fill_in_data mb_pi.g.exe

1213e5 0.4% 100.0% 1735e5 mb_fill_in_data mb_pi.O.exe

1e5 0.0% 100.0% 0 main mb_pi.g.exe

1e5 0.0% 100.0% 0 main mb_pi.O.exe

% tclsh

% expr {145161e5 / 177810e5}

0.816382655644

% expr {100236e5 / 86298e5}

1.16151011611

HPCPI/Xtools Performance Analysis Toolset

multiplex arbitrary events
Multiplex arbitrary events
  • DCacheEvents
    • CPU_CLK_UNHALTED
    • RETIRED_INSTRS
    • DISPATCH_STALLS
    • DATA_CACHE_ACCESSES
    • DATA_CACHE_MISSES
    • DATA_CACHE_REFILLS_FROM_L2.ALL
    • DATA_CACHE_REFILLS_FROM_SYSTEM.ALL
    • DATA_CACHE_LINES_EVICTED.ALL
    • L1DTLB_MISS_L2DTLB_HIT
    • L1DTLB_AND_L2DTLB_MISS
    • L2_REQUESTS.ALL
    • L2_MISSES.ALL
    • L2_FILLS.ALL
  • Why not just do them all?
    • Or more!
  • Unique to HPCPI

HPCPI/Xtools Performance Analysis Toolset

dcacheevents event set on dear rate
DCacheEvents event set on dear_rate
  • Setup:

% hpcpid -events DCacheEvents

Using info for 'AMD64 (family 0Fh)' PMU

13 tags, user definition:

pretty formal interval duty randomize

--------- ---------------------------------- -------- ------ ---------

Cycles CPU_CLK_UNHALTED 60000 always no

Retired RETIRED_INSTRS 60000 1 no

StallsAll DISPATCH_STALLS 60000 1 no

DATA_CACHE_ACCESSES 12000 1 no

DATA_CACHE_MISSES 12000 1 no

DATA_CACHE_REFILLS_FROM_L2.ALL 12000 1 no

DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 12000 1 no

DATA_CACHE_LINES_EVICTED.ALL 12000 1 no

L1DTLB_MISS_L2DTLB_HIT 12000 1 no

L1DTLB_AND_L2DTLB_MISS 12000 1 no

L2_REQUESTS.ALL 12000 1 no

L2_MISSES.ALL 12000 1 no

L2_FILLS.ALL 12000 1 no

maintainVCT = false

4 groups; user definition:

# 0 1 2 3

1 CPU_CLK_UNHALTED RETIRED_INSTRS DISPATCH_STALLS DATA_CACHE_ACCESSES

2 CPU_CLK_UNHALTED DATA_CACHE_MISSES DATA_CACHE_REFILLS_FROM_L2.ALL DATA_CACHE_REFILLS_FROM_SYSTEM.ALL

3 CPU_CLK_UNHALTED DATA_CACHE_LINES_EVICTED.ALL L1DTLB_MISS_L2DTLB_HIT L1DTLB_AND_L2DTLB_MISS

4 CPU_CLK_UNHALTED L2_REQUESTS.ALL L2_MISSES.ALL L2_FILLS.ALL

----

multiplexing interval = 1000000

----

Logging to /tmp/david_ll/hpcpid-hpc6.log

Daemon is running on pid 6877

HPCPI/Xtools Performance Analysis Toolset

dcacheevents event set on dear rate1
DCacheEvents event set on dear_rate
  • Run:

% (limit cpu 20sec; ./dear_rate.x86_64-Linux.exe 8 4 20 91)

8 reads in blocks of 4, increment 91

20000 iters in 19366069 cycles = 968.30 cycles/iter, 145.36MB/sec

40000 iters in 37588341 cycles = 939.71 cycles/iter, 149.80MB/sec

160000 iters in 144011554 cycles = 900.07 cycles/iter, 156.42MB/sec

1111987 iters in 999881564 cycles = 899.18 cycles/iter, 156.58MB/sec

1111987 iters in 1000096567 cycles = 899.38 cycles/iter, 156.54MB/sec

1111987 iters in 999305772 cycles = 898.67 cycles/iter, 156.67MB/sec

Cputime limit exceeded

  • Flush:

% hpcpictl flush

hpcpictl flush successful

HPCPI/Xtools Performance Analysis Toolset

dcacheevents event set on dear rate2
DCacheEvents event set on dear_rate
  • Observe:

% hpcpiprof ./dear_rate.x86_64-Linux.exe

Event Name Events Period Samples Active Fraction

---------------------------------- ----------- ------ ------- ---------------

CPU_CLK_UNHALTED 42096660000 60000 701611 100.00%

RETIRED_INSTRS 2004960000 60000 8354 25.00%

DISPATCH_STALLS 41377200000 60000 172405 25.00%

DATA_CACHE_ACCESSES 729216000 12000 15192 25.00%

DATA_CACHE_MISSES 395664000 12000 8243 25.00%

DATA_CACHE_REFILLS_FROM_L2.ALL 394944000 12000 8228 25.00%

DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 389664000 12000 8118 25.00%

DATA_CACHE_LINES_EVICTED.ALL 784080000 12000 16335 25.00%

L1DTLB_MISS_L2DTLB_HIT 4560000 12000 95 25.00%

L1DTLB_AND_L2DTLB_MISS 71712000 12000 1494 25.00%

L2_REQUESTS.ALL 490704000 12000 10223 25.00%

L2_MISSES.ALL 390336000 12000 8132 25.00%

L2_FILLS.ALL 395904000 12000 8248 25.00%

DATA_CACHE_ DATA_CACHE_ DATA_CACHE_

REFILLS_ REFILLS_ LINES_ L1DTLB_

CPU_CLK_ RETIRED_ DISPATCH_ DATA_CACHE_ DATA_CACHE_ FROM_L2 FROM_SYSTEM EVICTED MISS_ L1DTLB_AND_ L2_REQUESTS L2_MISSES L2_FILLS

UNHALTED % cum% INSTRS STALLS ACCESSES MISSES .ALL .ALL .ALL L2DTLB_HIT L2DTLB_MISS .ALL .ALL .ALL procedure image

-------- ------ ------ -------- --------- ----------- ----------- ----------- ----------- ----------- ---------- ----------- ----------- --------- -------- ------------- --------------------------

420965e5 100.0% 100.0% 20050e5 413772e5 7292e5 3957e5 3949e5 3897e5 7841e5 46e5 717e5 4907e5 3903e5 3959e5 read_memory8d dear_rate.x86_64-Linux.exe

1e5 0.0% 100.0% 0 0 0e5 0 0 0 0 0 0 0 0 0 main dear_rate.x86_64-Linux.exe

HPCPI/Xtools Performance Analysis Toolset

the label feature
The ‘label’ feature
  • Partitions samples, usually based on process(es)
  • See the man page for hpcpilabel
  • DCPI classic label:

% hpcpictl label run1 a.out one 1 uno

% hpcpictl label run2 a.out two 2 dos

  • Restrict to a script and its children:

% hpcpictl label specs –pgid this runSpec

  • Snapshot a system-wide interval:

% hpcpictl label oneMinute –pid -1 –not sleep 60

  • “Attach” to a process

% hpcpictl label attached –pid desiredPID sleep 99999

  • Monitor the idle process on CPU 0 for 5 minutes:

% hpcpictl label pid0cpu0 –pid 0 –cpu 0 –and sleep 300

  • Can be initiated and managed by programs
    • Use popen() of hpcpictl with ‘–pgid this’ or ‘-pidparent’
  • Don’t forget to hpcpictl flush
  • Use ‘-label labelName’ with the analysis tools

HPCPI/Xtools Performance Analysis Toolset

attention to accuracy itanium
Attention to accuracy (Itanium)
  • Wrote micro-benchmarks with known behavior
  • Eliminated post-unfreeze-pre-RFI event leaks
      • Micro-benchmark has no NOPS nor any predicate-squashed instructions
  • Determined event-based multiplexing better than time-based
      • Micro-benchmark has known (high) IPC

HPCPI/Xtools Performance Analysis Toolset

xtools
Xtools
  • Pair of visualization tools
  • Separable and cooperative with HPCPI
  • xclus
    • Cluster-wide monitoring
    • Utilizations: CPU, DRAM, HyperTransports
  • xperf
    • Single-node monitoring
    • Graphs of derived events based on hardware counters
      • CPU utilization, IPC, cycle accounting, cache penalties, I/O activity, etc

HPCPI/Xtools Performance Analysis Toolset

basic structure of a system
D

R

A

M

CPU/Core

CPU/Core

D

R

A

M

CPU/Core

CPU/Core

1/11

10/11

0

1

2

N-3

N-2

N-1

MPI_Finalize()

Basic structure of a system

For icon-design of xclus:

  • Processors with CPUs/cores
  • Local DRAM
  • HyperTransports

mpi_two_way 1/11:

  • Outer processes exchange data
  • Rank 0 sends 1/11 to rank n-1; receives 10/11
  • Rank n-1 sends 10/11 to rank 0; receives 1/11
  • 1:10 ratio; easy to see

HPCPI/Xtools Performance Analysis Toolset

xclus cluster running mpi two way default utilization
xclus, cluster running mpi_two_way; default: Utilization

HPCPI/Xtools Performance Analysis Toolset

xclus cluster running mpi two way show bandwidth control and data
xclus, cluster running mpi_two_way; show bandwidth (control and data)

HPCPI/Xtools Performance Analysis Toolset

xclus cluster running mpi two way show bandwidth data only
xclus, cluster running mpi_two_way; show bandwidth, data only

HPCPI/Xtools Performance Analysis Toolset

xclus cluster running mpi two way wave over pop up
xclus, cluster running mpi_two_way; wave-over pop-up

HPCPI/Xtools Performance Analysis Toolset

xclus node grouping
1/11 

1/11 

1/11 

1/11 

10/11

10/11

10/11

10/11

4

4

4

4

5

5

5

5

6

6

6

6

7

7

7

7

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

8

8

8

8

9

9

9

9

A

A

A

A

B

B

B

B

MPI_Finalize()

MPI_Finalize()

MPI_Finalize()

MPI_Finalize()

xclus node grouping

Run mpi_two_way on 12 processes (3 nodes), 4 times

  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11
  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11
  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11
  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11

Without node grouping:

  • xclus [ –no-group-nodes ]

Force node grouping:

  • xclus –group-nodes

HPCPI/Xtools Performance Analysis Toolset

xclus 4x3 x4 mpi two way not grouped
xclus, (4x3)x4 mpi_two_way, not grouped

HPCPI/Xtools Performance Analysis Toolset

xclus 4x3 x4 mpi two way grouped
xclus, (4x3)x4 mpi_two_way, grouped

HPCPI/Xtools Performance Analysis Toolset

xclus 4x3 x4 mpi two way grouped node waveover
xclus, (4x3)x4 mpi_two_way, grouped, node waveover

HPCPI/Xtools Performance Analysis Toolset

xclus 4x3 x4 mpi two way grouped hypertransport waveover
xclus, (4x3)x4 mpi_two_way, grouped, HyperTransport waveover

HPCPI/Xtools Performance Analysis Toolset

xperf node specific time graphs of counter based metrics
xperf: node-specific time-graphs of counter-based metrics
  • xperf can be started by itself
    • xperf –nodenodename
  • Or by clicking a node in xclus
  • Demonstration program: memr_rate
    • Run in such a way that
      • On CPU 0: hits in L1 cache, gets 12+ GB/sec
      • On CPU 1: misses L1, hits L2, gets 2 GB/sec
      • On CPU 2: starts missing L2, gets 700 MB/sec
      • On CPU 3: misses L2, gets 400 MB/sec

HPCPI/Xtools Performance Analysis Toolset

xperf initially
xperf, initially

HPCPI/Xtools Performance Analysis Toolset

xperf hide several graphs
xperf, hide several graphs

HPCPI/Xtools Performance Analysis Toolset

xperf with tear off color keys
xperf, with tear-off color keys

HPCPI/Xtools Performance Analysis Toolset

xperf hpcpi s initial presentation
xperf: HPCPI’s initial presentation

HPCPI/Xtools Performance Analysis Toolset

xperf hpcpi top image
xperf/HPCPI, top image

HPCPI/Xtools Performance Analysis Toolset

xperf hpcpi top procedure
xperf/HPCPI, top procedure

HPCPI/Xtools Performance Analysis Toolset

xperf hpcpi top procedure scrolled
xperf/HPCPI, top procedure, scrolled

HPCPI/Xtools Performance Analysis Toolset

recap
Recap
  • HPCPI
    • Sampling profiler
    • High frequency
    • Low overhead (Itanium)
    • Arbitrary events (auto-placement, auto-multiplexing)
    • Attention to accuracy
    • ‘label’ feature
  • Xtools
    • xclus: cluster-wide utilization visualizer
    • xperf: node-specific time-graphs of counter-based metrics
      • Integrated with HPCPI

HPCPI/Xtools Performance Analysis Toolset

slide39
[End]
  • Questions?
  • Discussion?
  • Break
  • Next: HPCPI documentation review

HPCPI/Xtools Performance Analysis Toolset

ad