profiling techniques on sdsc systems dmitry pekurovsky sdsc summer institute july 16 2007 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007 PowerPoint Presentation
Download Presentation
Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007

Loading in 2 Seconds...

play fullscreen
1 / 32

Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007 - PowerPoint PPT Presentation

  • Uploaded on

Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007. Overview of Talk. Standard profiling using prof, gprof and gmon (DataStar, Blue Gene) IBM High Performance Computing Toolkit (IHPCT) on DataStar and Blue Gene Hardware Performance Monitoring – HPM

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007' - frey

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview of talk
Overview of Talk
  • Standard profiling using prof, gprof and gmon (DataStar, Blue Gene)
  • IBM High Performance Computing Toolkit (IHPCT) on DataStar and Blue Gene
    • Hardware Performance Monitoring – HPM
    • MPI Tracer/Profiler
    • Xprofiler - CPU profiling tool
    • PeekPerf – Visualization of performance trace information
  • Integrated Performance Monitoring (IPM) on DataStar and Blue Gene
  • Suggested standard tuning procedure
standard profiling using prof gprof
Standard Profiling using prof, gprof
  • Standard profiling (prof, gprof) is available on both DataStar and Blue Gene.
  • Three levels of profiling are available with gmon, depending on the –pg and –g options on the compile and link commands
    • Timer tick profiling information: Add –pg to the link options
    • Procedure level profiling with timer tick info: Add –pg to compile and link options
    • Full profiling – call graph info, statement level profiling, basic block profiling, and machine instruction profiling: Add –pg –g to the compile and link options
  • Each task generates a gmon.out.x file where x corresponds to the rank of the task.
  • Output can be read using the gprof command (and Xprofiler as detailed later in the talk)
example of profiling using gmon
Example of profiling using gmon
  • Step 1: Compile using the –pg and –g options:

mpxlf –pg –g pois-imp.f –o example1

  • Step 2: Run the code to produce the gmon.out.x files:

ds100 % ls gmon.out.*

gmon.out.0 gmon.out.1 gmon.out.2 gmon.out.3

  • Step 3: Use gprof to analyze the output:

gprof -s example1 gmon.out.0 gmon.out.1 gmon.out.2 gmon.out.3

gprof example1 gmon.sum > summary.dat

(The first command produces gmon.sum which is analyzed in the second line and the ouput redirected to summary.dat)

example of profiling using gmon1
Example of profiling using gmon
  • gprof output has the call graph info, flat profile and function index
  • Section of sample call graph:

called/total parents

index %time self descendents called+self name index

called/total children

0.00 4.13 4/4 .__start [3]

[1] 82.8 0.00 4.13 4 .poisimp [1]

4.10 0.03 4/4 .poisson [2]


4.10 0.03 4/4 .poisimp [1]

[2] 82.8 4.10 0.03 4 .poisson [2]

0.02 0.00 32000/32000 .solve [12]

0.01 0.00 84032/84032 ._sin [26]


example of profiling using gmon2
Example of profiling using gmon
  • Section of sample flat profile & function index

% cumulative self self total

time seconds seconds calls ms/call ms/call name

82.2 4.10 4.10 4 1025.00 1032.50 .poisson [2]

1.2 4.16 0.06 uitrunc_const [4]

1.0 4.21 0.05 .lapi_recv_vec [5]

1.0 4.26 0.05 .shm_submit_slot [6]

0.8 4.30 0.04 ._Vector_dgsp_xfer [7]

0.8 4.34 0.04 .mpci_send [8]

0.6 4.37 0.03 ._lapi_shm_get [9]

0.6 4.40 0.03 ._mpi_allgather [10]

0.6 4.43 0.03 .allgather_tree_b [11]

0.4 4.45 0.02 32000 0.00 0.00 .solve [12]



Index by function name

[27] .LAPI_Util.GL [9] ._lapi_shm_get [47] .atoi

[28] .MPI(int) [40] ._lapi_shm_setup [48] .fast_free

[29] .MPID_Welcome_EA [41] ._mem_alloc [49] .fclose_unlocked

[13] .MPID_msg_arrived [10] ._mpi_allgather [50] .fflush_unlocked

[30] .MPI__Allgather [17] ._null_hndlr [51] .free.GL

  • CPU profiling tool similar to gprof
  • Can be used to profile both serial and parallel applications
  • Use procedure-profiling information to construct a graphical display of the functions within an application
  • Provide quick access to the profiled data and helps users identify functions that are the most CPU-intensive
  • Based on sampling (support from both compiler and kernel)
  • Charge execution time to source lines and show disassembly code
  • Xprofiler is in your default path on Datastar
running xprofiler
Running Xprofiler
  • Compile the program with –g -pg
  • Run the program
  • gmon.out file is generated (MPI applications generate gmon.out.1, …, gmon.out.n)
  • On datastar: xprofiler a.out gmon.*
  • Run Xprofiler
xprofiler main display
Xprofiler: Main Display
  • Width of a bar:time includingcalled routines
  • Height of a bar:time excludingcalled routines
  • Call arrowslabeled withnumber of calls
  • Overview windowfor easy navigation(View  Overview)
  • hpmcount runs an application and then reports execution wall clock time, hardware performance counter information, derived hardware metrics, and resource utilization statistics (such as memory usage!) .
  • Usage
    • Serial Job: hpmcount executable_name
    • Parallel jobs: poe hpmcount executable_name <poe options>
  • For parallel jobs you can use the above poe line in a batch script.
  • Note that there is hpm output for each task when you run this in parallel. Set MP_LABELIO=yes to identify the task ID of the output
hpmcount cont
HPMCOUNT (cont.)
  • Hardware counters can measure number of cycles used, instructions completed, floating point instructions, cache misses etc.
  • The choice of which measurements are performed is determined by the hpm event group selected
  • No need to recompile code
  • Minimal effect on code performance
  • Profiles entire code
hpm event groups
HPM event groups
  • Using a non default group add –g X flag to hpmcount, where X is the # of the group. 60 in total, 5 are useful for performance.
    • 60, for counts of cycles, instructions, and FP operations (including divides, FMA, loads, and stores).
    • 56, for counts of cycles, instructions, TLB misses, loads, stores, and L1 misses
    • 5, for counts of loads from L2, L3, and memory.
    • 58, for counts of cycles, instructions, loads from L3, and loads from memory.
    • 53, for counts of cycles, instructions, fixed-point operations, and FP operations (includes divides, SQRT, FMA, and FMOV or FEST).
  • Instrumentation library
  • Provides performance information for instrumented program sections
  • Supports multiple (nested) instrumentation sections
  • Multiple sections may have the same ID
  • Run-time performance information collection
  • Available on Datastar and Blue Gene
  • hpmInit( taskID, progName ) / f_hpminit( taskID, progName )
    • taskID is an integer value indicating the node ID.
    • progName is a string with the program name.
  • hpmStart( instID, label ) / f_hpmstart( instID, label )
    • instID is the instrumented section ID. It should be > 0 and <= 100 ( can be overridden)
    • Label is a string containing a label, which is displayed by PeekPerf.
  • hpmStop( instID ) / f_hpmstop( instID )
    • For each call to hpmStart, there should be a corresponding call to hpmStop with matching instID
  • hpmTerminate( taskID ) / f_hpmterminate( taskID )
    • This function will generate the output. If the program exits without calling hpmTerminate, no performance information will be generated.
message passing performance
Message-Passing Performance:
  • MP_Profiler Library
    • Captures “summary” data for MPI calls
    • Source code traceback
    • User MUST call MPI_Finalize() in order to get output files.
    • No changes to source code
      • MUST compile with –g to obtain source line number information
  • MP_Tracer Library
    • Captures “timestamped” data for MPI calls
    • Source traceback
  • Available on Datastar and Blue Gene
trace flags
Trace flags
  • Datastar:



MPITRACE = -L$(TRACELIB) -lmpitrace

MPIPROF = -L$(TRACELIB) –lmpiprof


HPMLIB = -L$(IHPCT_BASE)/lib/pwr4 -lhpm_r -lpmapi -lm

  • Blue Gene (new - untested)

IHPCT_BASE = /usr/local/apps/hpc_toolkit


MPITRACE = -L$(TRACELIB) –lmpitrace_f ( OR -lmpitrace_c)


HPMLIB = -L$(IHPCT_BASE)/lib –lhpm.rts -lpmapi -lm

environment flags
Environment Flags
    • Level of trace back the caller in the stack
    • Used to skipped wrappers
    • Default: 0
    • If set to “1”, plain text output is generated
    • Otherwise, a viz file is generated
    • If set to “1”, the output is shown for each source file
    • Otherwise, output is a summary of all source files
      • If set to “1”, the statistics for a function is shown for every message size
      • Otherwise, summary for all message sizes is given
  • PeekPerf is a viewer for data generated by HPM, Tracer and Profiling libraries, and DPOMP.
integrated performance monitoring ipm
Integrated Performance Monitoring (IPM)
  • Allows users to obtain a concise summary of the performance and communication characteristics of their codes.
  • Information on use available at
  • On Blue Gene you need to recompile your code, linking to the IPM library by adding

-L/usr/local/apps/ipm/lib/ -lipm

to the link stage. For example:

    • C: mpcc main.c -L/usr/local/apps/ipm/lib/ -lipm
    • Fortran: mpxlf90 main.f -L/usr/local/apps/ipm/lib/ -lipm
  • Run your job using poe-ipm on DataStar and mpirun-ipm on Blue Gene.
  • DO NOT use together with HPMCOUNT !
ipm output
IPM Output
  • In addition to summary, an in-depth analysis is available, including:
    • Load balancing
    • Communication pattern topology
    • Message size distribution
  • A file will be produced with a name combining your username and a number generated by IPM (for example mahidhar.1160615104.920400.0)
  • To generate a Web page showing detailed analysis of your code, run the ipm_parse_sdsc command followed by the filename.

bg-login1 0512/RUN1> /usr/local/apps/ipm/bin/ipm_parse_sdsc mahidhar.1160615104.920400.0

IPM at SDSC - Webpage creation in progress

Please wait - this may take several minutes.


IPM: Data processing finished - Creating HTML output - please wait.

The web page will be visible at:

Note the webpage will stay online for 30 days

It can be regenerated at any time,

or a local copy can be saved using your web browser

standard tuning procedure
Standard Tuning Procedure
  • Pick suitable dataset (a good representation of your production runs) and optimal processor set
  • Get rough estimate FLOPS. Running with hpmcount or IPM (on DataStar) is the quickest way to do this.
    • 5-15% of peak is normal range
  • Understand scaling problems by running at different processor count
  • Single processor performance profiling:
    • gprof, Xprofiler,HPM – identify routines or regions that dominate execution time
    • Consider creating a simple kernel that manifests the same behavior – ease of testing
    • HPM – study CPU performance in detail (cache use etc)
mpi profiling
MPI profiling
  • Run using IPM and/or MP_profiler to check
    • Communication/Computation ratio
    • Any anomalies, too many messages, too many collectives
    • Large differences between profiles of different tasks etc.
    • Many small messages – combine into larger messages
    • Communication pattern not suited for given network topology
  • Understand load imbalances if any
    • Ex: task 0 is spending too much time in I/O
    • Task n has very small communication time compared to others etc.
  • DataStar user guide

  • Blue Gene user guide

  • IBM HPC Toolkit Link

  • IPM