Performance Profiling Using hpmcount, poe+ & libhpm
Download
1 / 23

Performance Profiling Using hpmcount, poe+ & libhpm - PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on

Performance Profiling Using hpmcount, poe+ & libhpm. Richard Gerber NERSC User Services [email protected] 510-486-6820. Introduction. How to obtain performance numbers Tools based on IBM’s PMAPI Relevant for FY2003 ERCAP. Agenda. Low Level PAPI Interface HPM Toolkit hpmcount poe+

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Performance Profiling Using hpmcount, poe+ & libhpm' - lesley-finley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Performance Profiling Using hpmcount, poe+ & libhpm

Richard Gerber

NERSC User Services

[email protected]

510-486-6820


Introduction
Introduction

  • How to obtain performance numbers

  • Tools based on IBM’s PMAPI

  • Relevant for FY2003 ERCAP


Agenda
Agenda

  • Low Level PAPI Interface

  • HPM Toolkit

    • hpmcount

    • poe+

  • libhpm : hardware performance library


Overview
Overview

  • These tools are used for performance measurement

  • All can be used to tune applications and measure performance

  • Needed for FY 2003 ERCAP applications


Vocabulary
Vocabulary

  • PMAPI – IBM’s low-level interface

  • PAPI – Performance API (portable)

  • hpmcount, poe+ report overall code performance

  • libhpm can be used to instrument portions of code


PAPI

  • Standard application programming interface

  • Portable, don’t confuse with IBM low-level PMAPI interface

  • Can access hardware counter info

  • V2.1 at NERSC

  • See

    • http://hpcf.nersc.gov/software/papi.html

    • http://icl.cs.utk.edu/projects/papi/


Using papi
Using PAPI

  • PAPI is available through a module

    • module load papi

  • You place calls in source code

    • xlf –O3 source.F $PAPI

  • #include "fpapi.h“

  • integer*8 values(2)

  • integer counters(2), ncounters, irc

  • irc = PAPI_VER_CURRENT

  • CALL papif_library_init(irc)

  • counters(1)=PAPI_FMA_INS

  • counters(2)=PAPI_FP_INS

  • ncounters=2

  • CALL papif_start_counters(counters,ncounters,irc)

  • call papif_stop_counters(values,ncounters,irc)

  • write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2)


Hpmcount
hpmcount

  • Easy to use

  • Does not affect code performance

  • Profiles entire code

  • Uses hardware counters

  • Reports flip (floating point instruction) rate and many other quantities


Hpmcount usage
hpmcount usage

  • Serial

    • %hpmcount executable

  • Parallel

    • %poe hpmcount executable –nodes n -procs np

  • Gives performance numbers for each task

  • Prints output to STDOUT (or use –o filename)

  • Beware! These profile the poe command

    • hpmcount poe executable

    • hpmcount executable (if compiled with mp* compilers)


Hpmcount example
hpmcount example

ex1.f - Unoptimized matrix-matrix multiply

% xlf90 -o ex1 -O3 -qstrict ex1.f

% hpmcount ./ex1

hpmcount (V 2.3.1) summary

Total execution time (wall clock time): 17.258385 seconds

######## Resource Usage Statistics ########

Total amount of time in user mode : 17.220000 seconds

Total amount of time in system mode : 0.040000 seconds

Maximum resident set size : 3116 Kbytes

Average shared memory use in text segment : 6900 Kbytes*sec

Average unshared memory use in data segment : 5344036 Kbytes*sec

Number of page faults without I/O activity : 785

Number of page faults with I/O activity : 1

Number of times process was swapped out : 0

Number of times file system performed INPUT : 0

Number of times file system performed OUTPUT : 0

Number of IPC messages sent : 0

Number of IPC messages received : 0

Number of signals delivered : 0

Number of voluntary context switches : 1

Number of involuntary context switches : 1727

####### End of Resource Statistics ########


Hpmcount output
hpmcount output

ex1.f - Unoptimized matrix-matrix multiply

% xlf90 -o ex1 -O3 -qstrict ex1.f

% hpmcount ./ex1

PM_CYC (Cycles) : 6428126205

PM_INST_CMPL (Instructions completed) : 693651174

PM_TLB_MISS (TLB misses) : 122468941

PM_ST_CMPL (Stores completed) : 125758955

PM_LD_CMPL (Loads completed) : 250513627

PM_FPU0_CMPL (FPU 0 instructions) : 249691884

PM_FPU1_CMPL (FPU 1 instructions) : 3134223

PM_EXEC_FMA (FMAs executed) : 126535192

Utilization rate : 99.308 %

Avg number of loads per TLB miss : 2.046

Load and store operations : 376.273 M

Instructions per load/store : 1.843

MIPS : 40.192

Instructions per cycle : 0.108

HW Float points instructions per Cycle : 0.039

Floating point instructions + FMAs : 379.361 M

Float point instructions + FMA rate : 21.981 Mflip/s

FMA percentage : 66.710 %

Computation intensity : 1.008


Floating point measures
Floating point measures

  • PM_FPU0_CMPL (FPU 0 instructions)

  • PM_FPU1_CMPL (FPU 1 instructions)

    • The POWER3 processor has two Floating Point Units (FPU) which operate in parallel.

    • Each FPU can start a new instruction at every cycle.

    • This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by each FPU.

  • PM_EXEC_FMA (FMAs executed)

    • The POWER3 can execute a computation of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).


  • Total flop rate
    Total flop rate

    • Float point instructions + FMA rate

      • Float point instructions + FMAs gives the floating point operations. The two are added together since an FMA instruction yields 2 floating point operations.

      • The rate gives the code’s Mflops.

      • The POWER3 has a peak rate of 1500 Mflops. (375 MHz clock x 2 FPUs x 2Flops/FMA instruction)

      • Our example: 22 Mflops.


    Memory access
    Memory access

    • Average number of loads per TLB miss

      • Memory addresses that are in the Translation Lookaside Buffer can be accessed quickly.

        • Each time a TLB miss occurs, a new page (4KB, 512 8-byte elements) is brought into the buffer.

        • A value of ~500 means each element is accessed ~1 time while the page is in the buffer.

          • A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly.

          • Our example: 2.0


    Cache hits
    Cache hits

    • The –sN option to hpmcount specifies a different statistics set

    • -s2 will include L1 data cache hit rate

      • 33.4% for our example

      • See http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html for more options and descriptions.


    Optimizing the code
    Optimizing the code

    • Original code fragment

    DO I=1,N

    DO K=1,N

    DO J=1,N

    Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)

    END DO

    END DO

    END DO


    Optimizing the code1
    Optimizing the code

    • “Optimized” code: move I to inner loop

    DO J=1,N

    DO K=1,N

    DO I=1,N

    Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)

    END DO

    END DO

    END DO


    Optimized results
    Optimized results

    • Float point instructions + FMA rate

      • 461 vs. 22 Mflips (ESSL 933)

    • Avg number of loads per TLB miss

      • 20,877 vs. 2.0 (ESSL: 162)

    • L1 cache hit rate

      • 98.9% vs. 33.4%


    Using libhpm
    Using libhpm

    • libhpm can instrument code sections

    • Embed calls into source code

      • Fortran, C, C++

    • Contained in hpmtoolkit module

      • module load hpmtoolkit

    • compile with $HPMTOOLKIT

      • xlf –O3 source.F $HPMTOOLKIT

    • Execute program normally


    Hpmlib example
    hpmlib example

    #include f_hpm.h

    CALL f_hpminit(0,”someid")

    CALL f_hpmstart(1,"matrix-matrix multiply")

    DO J=1,N

    DO K=1,N

    DO I=1,N

    Z(I,J) = Z(I,J) + X(I,K) * Y(K,J)

    END DO

    END DO

    END DO

    CALL f_hpmstop(1)

    CALL f_hpmterminate(0)


    Parallel programs
    Parallel programs

    • poe hpmcount executable –nodes n –procs np

      • Will print output to STDOUT separately for each task

    • poe+ executable –nodes n –procs np

      • Will print aggregate number to STDOUT

    • libhpm

      • Writes output to a separate file for each task

    • Do not do these!

      • hpmcount poe executable …

      • hpmcount executable (if compiled with mp* compiler)


    Summary
    Summary

    • Utilities to measure performance

      • PAPI

      • hpmcount

      • poe+

      • hpmlib

    • You need to quote performance data in ERCAP application


    Where to get more information
    Where to Get More Information

    • NERSC Website: hpcf.nersc.gov

    • PAPI

      • http://hpcf.nersc.gov/software/tools/papi.html

    • hpmcount, poe+

      • http://hpcf.nersc.gov/software/ibm/hpmcount/

      • http://hpcf.nersc.gov/software/ibm/hpmcount/counter.html

    • hpmlib

      • http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html


    ad