Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007

Profiling techniques on SDSC systemsDmitry PekurovskySDSC Summer InstituteJuly 16, 2007

Overview of Talk • Standard profiling using prof, gprof and gmon (DataStar, Blue Gene) • IBM High Performance Computing Toolkit (IHPCT) on DataStar and Blue Gene • Hardware Performance Monitoring – HPM • MPI Tracer/Profiler • Xprofiler - CPU profiling tool • PeekPerf – Visualization of performance trace information • Integrated Performance Monitoring (IPM) on DataStar and Blue Gene • Suggested standard tuning procedure

Standard Profiling using prof, gprof • Standard profiling (prof, gprof) is available on both DataStar and Blue Gene. • Three levels of profiling are available with gmon, depending on the –pg and –g options on the compile and link commands • Timer tick profiling information: Add –pg to the link options • Procedure level profiling with timer tick info: Add –pg to compile and link options • Full profiling – call graph info, statement level profiling, basic block profiling, and machine instruction profiling: Add –pg –g to the compile and link options • Each task generates a gmon.out.x file where x corresponds to the rank of the task. • Output can be read using the gprof command (and Xprofiler as detailed later in the talk)

Available tools

Example of profiling using gmon • Step 1: Compile using the –pg and –g options: mpxlf –pg –g pois-imp.f –o example1 • Step 2: Run the code to produce the gmon.out.x files: ds100 % ls gmon.out.* gmon.out.0 gmon.out.1 gmon.out.2 gmon.out.3 • Step 3: Use gprof to analyze the output: gprof -s example1 gmon.out.0 gmon.out.1 gmon.out.2 gmon.out.3 gprof example1 gmon.sum > summary.dat (The first command produces gmon.sum which is analyzed in the second line and the ouput redirected to summary.dat)

Example of profiling using gmon • gprof output has the call graph info, flat profile and function index • Section of sample call graph: called/total parents index %time self descendents called+self name index called/total children 0.00 4.13 4/4 .__start [3] [1] 82.8 0.00 4.13 4 .poisimp [1] 4.10 0.03 4/4 .poisson [2] ------------------------------------------------------------------- 4.10 0.03 4/4 .poisimp [1] [2] 82.8 4.10 0.03 4 .poisson [2] 0.02 0.00 32000/32000 .solve [12] 0.01 0.00 84032/84032 ._sin [26] ------------------------------------------------------------------

Example of profiling using gmon • Section of sample flat profile & function index % cumulative self self total time seconds seconds calls ms/call ms/call name 82.2 4.10 4.10 4 1025.00 1032.50 .poisson [2] 1.2 4.16 0.06 uitrunc_const [4] 1.0 4.21 0.05 .lapi_recv_vec [5] 1.0 4.26 0.05 .shm_submit_slot [6] 0.8 4.30 0.04 ._Vector_dgsp_xfer [7] 0.8 4.34 0.04 .mpci_send [8] 0.6 4.37 0.03 ._lapi_shm_get [9] 0.6 4.40 0.03 ._mpi_allgather [10] 0.6 4.43 0.03 .allgather_tree_b [11] 0.4 4.45 0.02 32000 0.00 0.00 .solve [12] ... ... Index by function name [27] .LAPI_Util.GL [9] ._lapi_shm_get [47] .atoi [28] .MPI(int) [40] ._lapi_shm_setup [48] .fast_free [29] .MPID_Welcome_EA [41] ._mem_alloc [49] .fclose_unlocked [13] .MPID_msg_arrived [10] ._mpi_allgather [50] .fflush_unlocked [30] .MPI__Allgather [17] ._null_hndlr [51] .free.GL

Xprofiler • CPU profiling tool similar to gprof • Can be used to profile both serial and parallel applications • Use procedure-profiling information to construct a graphical display of the functions within an application • Provide quick access to the profiled data and helps users identify functions that are the most CPU-intensive • Based on sampling (support from both compiler and kernel) • Charge execution time to source lines and show disassembly code • Xprofiler is in your default path on Datastar

Running Xprofiler • Compile the program with –g -pg • Run the program • gmon.out file is generated (MPI applications generate gmon.out.1, …, gmon.out.n) • On datastar: xprofiler a.out gmon.* • Run Xprofiler

Xprofiler: Main Display • Width of a bar:time includingcalled routines • Height of a bar:time excludingcalled routines • Call arrowslabeled withnumber of calls • Overview windowfor easy navigation(View  Overview)

Xprofiler - Disassembler Code

HPMCOUNT • hpmcount runs an application and then reports execution wall clock time, hardware performance counter information, derived hardware metrics, and resource utilization statistics (such as memory usage!) . • Usage • Serial Job: hpmcount executable_name • Parallel jobs: poe hpmcount executable_name <poe options> • For parallel jobs you can use the above poe line in a batch script. • Note that there is hpm output for each task when you run this in parallel. Set MP_LABELIO=yes to identify the task ID of the output

HPMCOUNT (cont.) • Hardware counters can measure number of cycles used, instructions completed, floating point instructions, cache misses etc. • The choice of which measurements are performed is determined by the hpm event group selected • No need to recompile code • Minimal effect on code performance • Profiles entire code

HPM event groups • Using a non default group add –g X flag to hpmcount, where X is the # of the group. 60 in total, 5 are useful for performance. • 60, for counts of cycles, instructions, and FP operations (including divides, FMA, loads, and stores). • 56, for counts of cycles, instructions, TLB misses, loads, stores, and L1 misses • 5, for counts of loads from L2, L3, and memory. • 58, for counts of cycles, instructions, loads from L3, and loads from memory. • 53, for counts of cycles, instructions, fixed-point operations, and FP operations (includes divides, SQRT, FMA, and FMOV or FEST).

HPMCOUNT output example

LIBHPM • Instrumentation library • Provides performance information for instrumented program sections • Supports multiple (nested) instrumentation sections • Multiple sections may have the same ID • Run-time performance information collection • Available on Datastar and Blue Gene

Functions • hpmInit( taskID, progName ) / f_hpminit( taskID, progName ) • taskID is an integer value indicating the node ID. • progName is a string with the program name. • hpmStart( instID, label ) / f_hpmstart( instID, label ) • instID is the instrumented section ID. It should be > 0 and <= 100 ( can be overridden) • Label is a string containing a label, which is displayed by PeekPerf. • hpmStop( instID ) / f_hpmstop( instID ) • For each call to hpmStart, there should be a corresponding call to hpmStop with matching instID • hpmTerminate( taskID ) / f_hpmterminate( taskID ) • This function will generate the output. If the program exits without calling hpmTerminate, no performance information will be generated.

Message-Passing Performance: • MP_Profiler Library • Captures “summary” data for MPI calls • Source code traceback • User MUST call MPI_Finalize() in order to get output files. • No changes to source code • MUST compile with –g to obtain source line number information • MP_Tracer Library • Captures “timestamped” data for MPI calls • Source traceback • Available on Datastar and Blue Gene

Trace flags • Datastar: IHPCT_BASE=/usr/local/apps/ihpct TRACELIB=$(IHPCT_BASE)/lib MPITRACE = -L$(TRACELIB) -lmpitrace MPIPROF = -L$(TRACELIB) –lmpiprof HPMINC=$(IHPCT_BASE)/include HPMLIB = -L$(IHPCT_BASE)/lib/pwr4 -lhpm_r -lpmapi -lm • Blue Gene (new - untested) IHPCT_BASE = /usr/local/apps/hpc_toolkit TRACELIB=$(IHPCT_BASE)/lib MPITRACE = -L$(TRACELIB) –lmpitrace_f ( OR -lmpitrace_c) HPMINC=$(IHPCT_BASE)/include HPMLIB = -L$(IHPCT_BASE)/lib –lhpm.rts -lpmapi -lm

MP_Profiler Summary Output

MP_Profiler Sample Call Graph Output

MP_Profiler Message Size Distribution

Environment Flags • TRACELEVEL • Level of trace back the caller in the stack • Used to skipped wrappers • Default: 0 • TRACE_TEXTONLY • If set to “1”, plain text output is generated • Otherwise, a viz file is generated • TRACE_PERFILE • If set to “1”, the output is shown for each source file • Otherwise, output is a summary of all source files • TRACE_PERSIZE • If set to “1”, the statistics for a function is shown for every message size • Otherwise, summary for all message sizes is given

PeekPerf • PeekPerf is a viewer for data generated by HPM, Tracer and Profiling libraries, and DPOMP.

Integrated Performance Monitoring (IPM) • Allows users to obtain a concise summary of the performance and communication characteristics of their codes. • Information on use available at http://www.sdsc.edu/us/tools/top/ipm • On Blue Gene you need to recompile your code, linking to the IPM library by adding -L/usr/local/apps/ipm/lib/ -lipm to the link stage. For example: • C: mpcc main.c -L/usr/local/apps/ipm/lib/ -lipm • Fortran: mpxlf90 main.f -L/usr/local/apps/ipm/lib/ -lipm • Run your job using poe-ipm on DataStar and mpirun-ipm on Blue Gene. • DO NOT use together with HPMCOUNT !

IPM Output • In addition to summary, an in-depth analysis is available, including: • Load balancing • Communication pattern topology • Message size distribution • A file will be produced with a name combining your username and a number generated by IPM (for example mahidhar.1160615104.920400.0) • To generate a Web page showing detailed analysis of your code, run the ipm_parse_sdsc command followed by the filename. bg-login1 0512/RUN1> /usr/local/apps/ipm/bin/ipm_parse_sdsc mahidhar.1160615104.920400.0 IPM at SDSC - Webpage creation in progress Please wait - this may take several minutes. 100..200..300..400..500.. IPM: Data processing finished - Creating HTML output - please wait. The web page will be visible at: http://www.sdsc.edu/us/tools/top/ipm/output/bgsn.14860.0 Note the webpage will stay online for 30 days It can be regenerated at any time, or a local copy can be saved using your web browser

IPM results: Webpage snapshot

Standard Tuning Procedure • Pick suitable dataset (a good representation of your production runs) and optimal processor set • Get rough estimate FLOPS. Running with hpmcount or IPM (on DataStar) is the quickest way to do this. • 5-15% of peak is normal range • Understand scaling problems by running at different processor count • Single processor performance profiling: • gprof, Xprofiler,HPM – identify routines or regions that dominate execution time • Consider creating a simple kernel that manifests the same behavior – ease of testing • HPM – study CPU performance in detail (cache use etc)

MPI profiling • Run using IPM and/or MP_profiler to check • Communication/Computation ratio • Any anomalies, too many messages, too many collectives • Large differences between profiles of different tasks etc. • Many small messages – combine into larger messages • Communication pattern not suited for given network topology • Understand load imbalances if any • Ex: task 0 is spending too much time in I/O • Task n has very small communication time compared to others etc.

References • DataStar user guide http://www.sdsc.edu/us/resources/datastar/ • Blue Gene user guide http://www.sdsc.edu/us/resources/bluegene • IBM HPC Toolkit Link https://domino.research.ibm.com/comm/research_projects.nsf/pages/actc.index.html • IPM http://www.sdsc.edu/us/tools/top/ipm

Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007

Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007

Presentation Transcript

SDSC Blue Gene: Optimization and Debugging Mahidhar Tatineni SDSC, April 6, 2007

SDSC Data and Knowledge Systems

SDSC and CIEG Overview CIEG Workshop April, 2007

SDSC/UCSD Campus Update

Running on the SDSC Blue Gene

Green Datacenter Initiatives at SDSC

NPACI/SDSC Security Activities

Running jobs on SDSC Resources

SDSC Imaging Portal

SDSC, skitter (July 1998)

Overview of HPC SDSC Machines Science Enabled at SDSC

SDSC S R B survey

SDSC Blue Gene: Overview

Visualization at SDSC

Gridflows and SDSC Matrix

High End Computing at SDSC

Single-Processor Optimization Stuart Johnson, SDSC (sjohnson@sdsc)

Running jobs on SDSC Resources

SDSC Summer Institute 2005 TUTORIAL Data Mining for Scientific Applications

SDSC Summer Institute 2004 TUTORIAL Data Mining for Scientific Applications

SDSC RP Update

DataTurbine at SDSC