Hpcpi xtools performance analysis toolset
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

HPCPI/Xtools Performance Analysis Toolset PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on
  • Presentation posted in: General

HPCPI/Xtools Performance Analysis Toolset. David LaFrance-Linden High Performance Computing Division. Overview. HPCPI Statistical sampling profiler From DCPI (Digital Continuous Profiling Infrastructure) Compare (vaguely) to: OProfile (open source): conceptually based on DCPI

Download Presentation

HPCPI/Xtools Performance Analysis Toolset

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Hpcpi xtools performance analysis toolset

HPCPI/Xtools Performance Analysis Toolset

David LaFrance-Linden

High Performance Computing Division


Overview

Overview

  • HPCPI

    • Statistical sampling profiler

    • From DCPI (Digital Continuous Profiling Infrastructure)

    • Compare (vaguely) to:

      • OProfile (open source): conceptually based on DCPI

      • Caliper (from HP for Itanium): has many other modes/features

      • Vtune (from Intel): with GUI

      • CodeAnalyst (from AMD): with GUI

  • Xtools

    • Performance visualization tools

    • xclus: cluster-wide visualization tool

    • xperf: node-specific visualization tool

HPCPI/Xtools Performance Analysis Toolset


Hpcpi standard sampling

HPCPI – Standard sampling

  • Set default database location

    % setenv HPCPIDB ~/hpcpidb

  • Start daemon:

    % hpcpid

    Using info for 'AMD64 (family 0Fh)' PMU

    1 tags, user definition:

    pretty formal interval duty randomize

    ------ ---------------- -------- ------ ---------

    Cycles CPU_CLK_UNHALTED 60000 always no

    maintainVCT = false

    1 groups; user definition:

    # 0 1 2 3

    1 CPU_CLK_UNHALTED <empty> <empty> <empty>

    ----

    multiplexing interval = 1000000

    ----

    Logging to /tmp/david_ll/hpcpid-hpc6.log

    Daemon is running on pid 6310

HPCPI/Xtools Performance Analysis Toolset


Hpcpi standard sampling1

HPCPI – Standard sampling

  • Run programs:

    % time ./mb_pi.O.exe -iters 100

    3.1415926535897932384626433832795028

    3.995u 0.000s 0:04.00 99.7%0+0k 0+0io 0pf+0w

    % time ./mb_pi.g.exe -iters 100

    3.1415926535897932384626433832795028

    8.439u 0.000s 0:08.44 99.8%0+0k 0+0io 0pf+0w

  • Flush database to disk

    % hpcpictl flush

    hpcpictl flush successful

  • Analyze

    % hpcpiprof

    % hpcpiprof ./mb_pi.g.exe ./mb_pi.O.exe

    % hpcpilist mandel_val ./mb_pi.g.exe

HPCPI/Xtools Performance Analysis Toolset


Hpcpiprof by image

hpcpiprof (by image)

% hpcpiprof

Event Name Events Period Samples

---------------- ------------ ------ -------

CPU_CLK_UNHALTED 163217580000 60000 2720293

CPU_CLK_

UNHALTED % cum% image

-------- ----- ------ ----------------------------------

136108e6 83.4% 83.4% vmlinux-2.6.9-34.7hp.XCsmp.o.hpcpi

18077e6 11.1% 94.5% mb_pi.g.exe

8618e6 5.3% 99.7% mb_pi.O.exe

345e6 0.2% 100.0% emacs

19e6 0.0% 100.0% libc-2.3.4.so

7e6 0.0% 100.0% Xorg

6e6 0.0% 100.0% tg3.ko

5e6 0.0% 100.0% libgobject-2.0.so.0.400.7

4e6 0.0% 100.0% libX11.so.6.2

4e6 0.0% 100.0% libgdk-x11-2.0.so.0.400.13

4e6 0.0% 100.0% libglib-2.0.so.0.400.7

2e6 0.0% 100.0% hald

2e6 0.0% 100.0% ld-2.3.4.so

2e6 0.0% 100.0% ohci_hcd.ko

...

HPCPI/Xtools Performance Analysis Toolset


Hpcpiprof by procedure

hpcpiprof (by procedure)

% hpcpiprof ./mb_pi.O.exe ./mb_pi.g.exe

Event Name Events Period Samples

---------------- ----------- ------ -------

CPU_CLK_UNHALTED 26694720000 60000 444912

CPU_CLK_

UNHALTED % cum% procedure image

-------- ----- ------ --------------- -----------

175927e5 65.9% 65.9% mandel_val mb_pi.g.exe

84917e5 31.8% 97.7% mandel_val mb_pi.O.exe

4840e5 1.8% 99.5% mb_fill_in_data mb_pi.g.exe

1264e5 0.5% 100.0% mb_fill_in_data mb_pi.O.exe

HPCPI/Xtools Performance Analysis Toolset


Hpcpilist by source assembly

hpcpilist (by source/assembly)

% hpcpilist mandel_val mb_pi.g.exe

Event Name Events Period

---------------- ----------- ------

CPU_CLK_UNHALTED 17592720000 60000

CPU_CLK_UNHALTED Source

---------------- -----------------------------------------------------

34560e03 159: {

160: register NUMTYPE zr = cr, zi = ci, zr2, zi2;

161: register int n = 0, delta = nmax - nmin;

302220e03 162: register NUMTYPE rad2, four = CONSTANT4(4.0);

163: register int keepgoing;

365100e03 164: while (((n += 1),

165: (zr2 = MULT(zr,zr)),

166: (zi2 = MULT(zi,zi)),

167: (zi = 2*MULT(zr,zi) + ci),

168: (keepgoing = (n < nmax)),

169: (rad2 = zr2 + zi2),

170: (zr = zr2 - zi2 + cr),

171: (rad2 <= four))

172: && keepgoing)

173: ;

16696e06 174: if (cause_segv && cr > CONSTANT(0.0) ...) {

175: if (1) cause_segv = 0;

176: *(char*)(long)cause_segv = 1;

177: }

103620e03 178: return (n >= nmax ? delta :

179: n < delta ? n :

180: n == delta ? 0 :

181: n%delta);

182: }

HPCPI/Xtools Performance Analysis Toolset


Hpcpi s differentiators

HPCPI’s differentiators

  • Sample rate

  • Overhead

  • Features

    • Can sample more than 1 event

    • Can sample arbitrary number of events

    • The ‘label’ feature

  • Attention to accuracy

HPCPI/Xtools Performance Analysis Toolset


Sample rate and overhead

Sample rate and overhead

  • Default sample rate higher (interval lower), minumum sample rate high in comparison:

  • Low overhead: (Itanium; can’t on x86_64)

HPCPI/Xtools Performance Analysis Toolset


Feature can sample more than one event

Feature: Can sample more than one event

  • Useful for deriving metrics at image, routine or loop level

  • So can OProfile and Vtune and CodeAnalyst, but not yet Caliper

  • IPC

    • CPU_CLK_UNHALTED

    • RETIRED_INSTRS

HPCPI/Xtools Performance Analysis Toolset


Example ipc for mb pi

Example: IPC for ‘mb_pi’

  • Restart:

    % hpcpictl quit

    hpcpictl quit successful

    % hpcpid -events IPCEvents

    Using info for 'AMD64 (family 0Fh)' PMU

    2 tags, user definition:

    pretty formal interval duty randomize

    ------- ---------------- -------- ------ ---------

    Cycles CPU_CLK_UNHALTED 60000 always no

    Retired RETIRED_INSTRS 60000 1 no

    maintainVCT = false

    1 groups; user definition:

    # 0 1 2 3

    1 CPU_CLK_UNHALTED RETIRED_INSTRS <empty> <empty>

    ----

    multiplexing interval = 1000000

    ----

    Logging to /tmp/david_ll/hpcpid-hpc6.log

    Daemon is running on pid 6365

HPCPI/Xtools Performance Analysis Toolset


Example ipc for mb pi1

Example: IPC for ‘mb_pi’

  • Collect and report:

    %./mb_pi.O.exe -iters 100

    3.1415926535897932384626433832795028

    %./mb_pi.g.exe -iters 100

    3.1415926535897932384626433832795028

    % hpcpictl flush

    hpcpictl flush successful

    % hpcpiprof mb_pi.g.exe mb_pi.O.exe

    Event Name Events Period Samples

    ---------------- ----------- ------ -------

    CPU_CLK_UNHALTED 27032280000 60000 450538

    RETIRED_INSTRS 25146600000 60000 419110

    CPU_CLK_ RETIRED_

    UNHALTED % cum% INSTRS procedure image

    -------- ----- ------ -------- --------------- -----------

    177810e5 65.8% 65.8% 145161e5 mandel_val mb_pi.g.exe

    86298e5 31.9% 97.7% 100236e5 mandel_val mb_pi.O.exe

    5000e5 1.8% 99.6% 4334e5 mb_fill_in_data mb_pi.g.exe

    1213e5 0.4% 100.0% 1735e5 mb_fill_in_data mb_pi.O.exe

    1e5 0.0% 100.0% 0 main mb_pi.g.exe

    1e5 0.0% 100.0% 0 main mb_pi.O.exe

    % tclsh

    % expr {145161e5 / 177810e5}

    0.816382655644

    % expr {100236e5 / 86298e5}

    1.16151011611

HPCPI/Xtools Performance Analysis Toolset


Multiplex arbitrary events

Multiplex arbitrary events

  • DCacheEvents

    • CPU_CLK_UNHALTED

    • RETIRED_INSTRS

    • DISPATCH_STALLS

    • DATA_CACHE_ACCESSES

    • DATA_CACHE_MISSES

    • DATA_CACHE_REFILLS_FROM_L2.ALL

    • DATA_CACHE_REFILLS_FROM_SYSTEM.ALL

    • DATA_CACHE_LINES_EVICTED.ALL

    • L1DTLB_MISS_L2DTLB_HIT

    • L1DTLB_AND_L2DTLB_MISS

    • L2_REQUESTS.ALL

    • L2_MISSES.ALL

    • L2_FILLS.ALL

  • Why not just do them all?

    • Or more!

  • Unique to HPCPI

HPCPI/Xtools Performance Analysis Toolset


Dcacheevents event set on dear rate

DCacheEvents event set on dear_rate

  • Setup:

    % hpcpid -events DCacheEvents

    Using info for 'AMD64 (family 0Fh)' PMU

    13 tags, user definition:

    pretty formal interval duty randomize

    --------- ---------------------------------- -------- ------ ---------

    Cycles CPU_CLK_UNHALTED 60000 always no

    Retired RETIRED_INSTRS 60000 1 no

    StallsAll DISPATCH_STALLS 60000 1 no

    DATA_CACHE_ACCESSES 12000 1 no

    DATA_CACHE_MISSES 12000 1 no

    DATA_CACHE_REFILLS_FROM_L2.ALL 12000 1 no

    DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 12000 1 no

    DATA_CACHE_LINES_EVICTED.ALL 12000 1 no

    L1DTLB_MISS_L2DTLB_HIT 12000 1 no

    L1DTLB_AND_L2DTLB_MISS 12000 1 no

    L2_REQUESTS.ALL 12000 1 no

    L2_MISSES.ALL 12000 1 no

    L2_FILLS.ALL 12000 1 no

    maintainVCT = false

    4 groups; user definition:

    # 0 1 2 3

    1 CPU_CLK_UNHALTED RETIRED_INSTRS DISPATCH_STALLS DATA_CACHE_ACCESSES

    2 CPU_CLK_UNHALTED DATA_CACHE_MISSES DATA_CACHE_REFILLS_FROM_L2.ALL DATA_CACHE_REFILLS_FROM_SYSTEM.ALL

    3 CPU_CLK_UNHALTED DATA_CACHE_LINES_EVICTED.ALL L1DTLB_MISS_L2DTLB_HIT L1DTLB_AND_L2DTLB_MISS

    4 CPU_CLK_UNHALTED L2_REQUESTS.ALL L2_MISSES.ALL L2_FILLS.ALL

    ----

    multiplexing interval = 1000000

    ----

    Logging to /tmp/david_ll/hpcpid-hpc6.log

    Daemon is running on pid 6877

HPCPI/Xtools Performance Analysis Toolset


Dcacheevents event set on dear rate1

DCacheEvents event set on dear_rate

  • Run:

    % (limit cpu 20sec; ./dear_rate.x86_64-Linux.exe 8 4 20 91)

    8 reads in blocks of 4, increment 91

    20000 iters in 19366069 cycles = 968.30 cycles/iter, 145.36MB/sec

    40000 iters in 37588341 cycles = 939.71 cycles/iter, 149.80MB/sec

    160000 iters in 144011554 cycles = 900.07 cycles/iter, 156.42MB/sec

    1111987 iters in 999881564 cycles = 899.18 cycles/iter, 156.58MB/sec

    1111987 iters in 1000096567 cycles = 899.38 cycles/iter, 156.54MB/sec

    1111987 iters in 999305772 cycles = 898.67 cycles/iter, 156.67MB/sec

    Cputime limit exceeded

  • Flush:

    % hpcpictl flush

    hpcpictl flush successful

HPCPI/Xtools Performance Analysis Toolset


Dcacheevents event set on dear rate2

DCacheEvents event set on dear_rate

  • Observe:

    % hpcpiprof ./dear_rate.x86_64-Linux.exe

    Event Name Events Period Samples Active Fraction

    ---------------------------------- ----------- ------ ------- ---------------

    CPU_CLK_UNHALTED 42096660000 60000 701611 100.00%

    RETIRED_INSTRS 2004960000 60000 8354 25.00%

    DISPATCH_STALLS 41377200000 60000 172405 25.00%

    DATA_CACHE_ACCESSES 729216000 12000 15192 25.00%

    DATA_CACHE_MISSES 395664000 12000 8243 25.00%

    DATA_CACHE_REFILLS_FROM_L2.ALL 394944000 12000 8228 25.00%

    DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 389664000 12000 8118 25.00%

    DATA_CACHE_LINES_EVICTED.ALL 784080000 12000 16335 25.00%

    L1DTLB_MISS_L2DTLB_HIT 4560000 12000 95 25.00%

    L1DTLB_AND_L2DTLB_MISS 71712000 12000 1494 25.00%

    L2_REQUESTS.ALL 490704000 12000 10223 25.00%

    L2_MISSES.ALL 390336000 12000 8132 25.00%

    L2_FILLS.ALL 395904000 12000 8248 25.00%

    DATA_CACHE_ DATA_CACHE_ DATA_CACHE_

    REFILLS_ REFILLS_ LINES_ L1DTLB_

    CPU_CLK_ RETIRED_ DISPATCH_ DATA_CACHE_ DATA_CACHE_ FROM_L2 FROM_SYSTEM EVICTED MISS_ L1DTLB_AND_ L2_REQUESTS L2_MISSES L2_FILLS

    UNHALTED % cum% INSTRS STALLS ACCESSES MISSES .ALL .ALL .ALL L2DTLB_HIT L2DTLB_MISS .ALL .ALL .ALL procedure image

    -------- ------ ------ -------- --------- ----------- ----------- ----------- ----------- ----------- ---------- ----------- ----------- --------- -------- ------------- --------------------------

    420965e5 100.0% 100.0% 20050e5 413772e5 7292e5 3957e5 3949e5 3897e5 7841e5 46e5 717e5 4907e5 3903e5 3959e5 read_memory8d dear_rate.x86_64-Linux.exe

    1e5 0.0% 100.0% 0 0 0e5 0 0 0 0 0 0 0 0 0 main dear_rate.x86_64-Linux.exe

HPCPI/Xtools Performance Analysis Toolset


The label feature

The ‘label’ feature

  • Partitions samples, usually based on process(es)

  • See the man page for hpcpilabel

  • DCPI classic label:

    % hpcpictl label run1 a.out one 1 uno

    % hpcpictl label run2 a.out two 2 dos

  • Restrict to a script and its children:

    % hpcpictl label specs –pgid this runSpec

  • Snapshot a system-wide interval:

    % hpcpictl label oneMinute –pid -1 –not sleep 60

  • “Attach” to a process

    % hpcpictl label attached –pid desiredPID sleep 99999

  • Monitor the idle process on CPU 0 for 5 minutes:

    % hpcpictl label pid0cpu0 –pid 0 –cpu 0 –and sleep 300

  • Can be initiated and managed by programs

    • Use popen() of hpcpictl with ‘–pgid this’ or ‘-pidparent’

  • Don’t forget to hpcpictl flush

  • Use ‘-label labelName’ with the analysis tools

HPCPI/Xtools Performance Analysis Toolset


Attention to accuracy itanium

Attention to accuracy (Itanium)

  • Wrote micro-benchmarks with known behavior

  • Eliminated post-unfreeze-pre-RFI event leaks

    • Micro-benchmark has no NOPS nor any predicate-squashed instructions

  • Determined event-based multiplexing better than time-based

    • Micro-benchmark has known (high) IPC

HPCPI/Xtools Performance Analysis Toolset


Xtools

Xtools

  • Pair of visualization tools

  • Separable and cooperative with HPCPI

  • xclus

    • Cluster-wide monitoring

    • Utilizations: CPU, DRAM, HyperTransports

  • xperf

    • Single-node monitoring

    • Graphs of derived events based on hardware counters

      • CPU utilization, IPC, cycle accounting, cache penalties, I/O activity, etc

HPCPI/Xtools Performance Analysis Toolset


Basic structure of a system

D

R

A

M

CPU/Core

CPU/Core

D

R

A

M

CPU/Core

CPU/Core

1/11

10/11

0

1

2

N-3

N-2

N-1

MPI_Finalize()

Basic structure of a system

For icon-design of xclus:

  • Processors with CPUs/cores

  • Local DRAM

  • HyperTransports

    mpi_two_way 1/11:

  • Outer processes exchange data

  • Rank 0 sends 1/11 to rank n-1; receives 10/11

  • Rank n-1 sends 10/11 to rank 0; receives 1/11

  • 1:10 ratio; easy to see

HPCPI/Xtools Performance Analysis Toolset


Xclus cluster running mpi two way default utilization

xclus, cluster running mpi_two_way; default: Utilization

HPCPI/Xtools Performance Analysis Toolset


Xclus cluster running mpi two way show bandwidth control and data

xclus, cluster running mpi_two_way; show bandwidth (control and data)

HPCPI/Xtools Performance Analysis Toolset


Xclus cluster running mpi two way show bandwidth data only

xclus, cluster running mpi_two_way; show bandwidth, data only

HPCPI/Xtools Performance Analysis Toolset


Xclus cluster running mpi two way wave over pop up

xclus, cluster running mpi_two_way; wave-over pop-up

HPCPI/Xtools Performance Analysis Toolset


Xclus node grouping

1/11 

1/11 

1/11 

1/11 

10/11

10/11

10/11

10/11

4

4

4

4

5

5

5

5

6

6

6

6

7

7

7

7

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

8

8

8

8

9

9

9

9

A

A

A

A

B

B

B

B

MPI_Finalize()

MPI_Finalize()

MPI_Finalize()

MPI_Finalize()

xclus node grouping

Run mpi_two_way on 12 processes (3 nodes), 4 times

  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11

  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11

  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11

  • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11

    Without node grouping:

  • xclus [ –no-group-nodes ]

    Force node grouping:

  • xclus –group-nodes

HPCPI/Xtools Performance Analysis Toolset


Xclus 4x3 x4 mpi two way not grouped

xclus, (4x3)x4 mpi_two_way, not grouped

HPCPI/Xtools Performance Analysis Toolset


Xclus 4x3 x4 mpi two way grouped

xclus, (4x3)x4 mpi_two_way, grouped

HPCPI/Xtools Performance Analysis Toolset


Xclus 4x3 x4 mpi two way grouped node waveover

xclus, (4x3)x4 mpi_two_way, grouped, node waveover

HPCPI/Xtools Performance Analysis Toolset


Xclus 4x3 x4 mpi two way grouped hypertransport waveover

xclus, (4x3)x4 mpi_two_way, grouped, HyperTransport waveover

HPCPI/Xtools Performance Analysis Toolset


Xperf node specific time graphs of counter based metrics

xperf: node-specific time-graphs of counter-based metrics

  • xperf can be started by itself

    • xperf –nodenodename

  • Or by clicking a node in xclus

  • Demonstration program: memr_rate

    • Run in such a way that

      • On CPU 0: hits in L1 cache, gets 12+ GB/sec

      • On CPU 1: misses L1, hits L2, gets 2 GB/sec

      • On CPU 2: starts missing L2, gets 700 MB/sec

      • On CPU 3: misses L2, gets 400 MB/sec

HPCPI/Xtools Performance Analysis Toolset


Xperf initially

xperf, initially

HPCPI/Xtools Performance Analysis Toolset


Xperf hide several graphs

xperf, hide several graphs

HPCPI/Xtools Performance Analysis Toolset


Xperf with tear off color keys

xperf, with tear-off color keys

HPCPI/Xtools Performance Analysis Toolset


Xperf hpcpi s initial presentation

xperf: HPCPI’s initial presentation

HPCPI/Xtools Performance Analysis Toolset


Xperf hpcpi top image

xperf/HPCPI, top image

HPCPI/Xtools Performance Analysis Toolset


Xperf hpcpi top procedure

xperf/HPCPI, top procedure

HPCPI/Xtools Performance Analysis Toolset


Xperf hpcpi top procedure scrolled

xperf/HPCPI, top procedure, scrolled

HPCPI/Xtools Performance Analysis Toolset


Recap

Recap

  • HPCPI

    • Sampling profiler

    • High frequency

    • Low overhead (Itanium)

    • Arbitrary events (auto-placement, auto-multiplexing)

    • Attention to accuracy

    • ‘label’ feature

  • Xtools

    • xclus: cluster-wide utilization visualizer

    • xperf: node-specific time-graphs of counter-based metrics

      • Integrated with HPCPI

HPCPI/Xtools Performance Analysis Toolset


Hpcpi xtools performance analysis toolset

[End]

  • Questions?

  • Discussion?

  • Break

  • Next: HPCPI documentation review

HPCPI/Xtools Performance Analysis Toolset


  • Login