Debugging Applications. Mahidhar Tatineni 13th Annual SDSC Summer Institute June 17, 2007. Debugging Tools. Some of the debugging tools available on the SDSC machines dbx (on DataStar) pdbx (on DataStar) Totalview gdb (on TG IA-64 cluster)
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Debugging Applications Mahidhar Tatineni 13th Annual SDSC Summer Institute June 17, 2007
Debugging Tools • Some of the debugging tools available on the SDSC machines • dbx (on DataStar) • pdbx (on DataStar) • Totalview • gdb (on TG IA-64 cluster) • Totalview is available on DataStar and the TG IA-64 cluster. • The following languages are supported by totalview • C • C++ • Fortran77 • Fortran90 • Assembler
Tips to minimize problems (avoid debugging)! • Use IMPLICIT NONE (in Fortran) • Comment your code – use spacing and indentation • Take care in choosing variable names. Try not to use characters which can be confused for others (l and 1 for example). • Use make if possible – This lets you define compilers, compiler flags, libraries consistently. • Use version control (particularly if there are multiple developers) software (RCS/CVS/SVN) • Can trace when a new bug is found. • Makes it easy to documentation changes.
Common MPI errors • FORTRAN vs C calls: The FORTRAN call to MPI routines usually have one more variable (ierr) than the C call. Users sometimes forget the ierr in the call causing errors. • Users dimensioning the status variable used in some MPI routines. This variable should be dimensioned as an integer array of length MPI_STATUS_SIZE: i.e. status(MPI_STATUS_SIZE) • Non-Blocking I/O: Some users mistakenly use MPI_WAIT with non-blocking I/O. The proper routine to call is MPIO_WAIT • Exceeding the limit on MPI Tags: If you are using automatic tag generation in your MPI program make sure that the tag is less than the tag limit set in the particular MPI implementation (2**32-1 on DataStar). • When using a reduction operation such as MPI_REDUCE, where a variable from all processors are combined in some way into a variable on one processor, make sure that the source and target names are different.
Debug Procedures • Turn off all optimizations and turn on debugging flags (-O –g) • Check memory references and array bounds –qcheck (or –C ) • Check subroutine calls sequences and mismatched common blocks –qextcheck • Check for floating point exceptions –qflttrap • Check for initialized variables –qinitauto • For more info on these flags see IBM compiler manuals online.
Simplify your problem ! • How small can you make the problem while the bug still occurs? • Minimize the number of mpi tasks • Minimize the number of input parameters affecting the issue • If you suspect a MPI problem reduce the computation part as much as possible (helps focus on the communication issues) • It is ok to use PRINT statements to quickly localize the problem! If you are doing so in a MPI code make sure to label the I/O so that you know which process has the problem • bash: export MP_LABELIO=yes • csh: setenv MP_LABELIO yes
Sample problem to illustrate debugging process • The sample code (sample.f) is located in /gpfs/projects/workshop/debug • Copy the file into your directory. You can use the following job submission script and modify it to run your codes /gpfs/projects/workshop/running_jobs/LLscript_mpi_p655 • Compiling the code is simple mpxlf sample.f
Sample problem to illustrate debugging process • The sample code initializes variables on all processors, does a simple computation and then sends data from proc 0 to other processors. • The code has three bugs • Uses an uninitialized variable • an array bounds violation • an incorrect MPI call • We will use bounds checking compiler flags and Totalview to debug this code. • Interesting note: With a simple compile on DataStar the bounds problem does not cause the code to crash. Just because a code runs without error does not mean there are no bugs!
Sample problem to illustrate debugging process • We first compile the code with no debug/check flags mpxlf sample.f • One of the array values is printed on both the send and receive processors. • Sample output 0: Proc 0 sending -0.9999996424E-01 to proc 1 1: Proc 1 received value 0.0000000000E+00 • Something is wrong!
Sample problem to illustrate debugging process • Now compile the code with -qflttrap –g flags mpxlf –g –qflttrap=nanq:enable sample.f • The code now dumps core and we see the following error ERROR: 0031-250 task 1: Trace/BPT trap • We have a problem on task 1. We can use dbx to locate it ds001 % dbx a.out coredir.1/core Type 'help' for help. warning: The core file is truncated. You may need to increasethe ulimit for file and coredump, or free some space on the filesystem. [using memory image in coredir.1/core] reading symbolic information ... Trace/BPT trap in sample at line 29 in file "sample.f" 29 u(i) = dsqrt(delx1-1.0d0)
Sample problem to illustrate debugging process • Now compile the code with -qcheck –g flags mpxlf –qcheck –g sample.f • The code now dumps core and we see the following error ERROR: 0031-250 task 0: Trace/BPT trap • We have a array bounds problem on task 0. We can use dbx to locate it ds100 % dbx a.out coredir.0/core Type 'help' for help. warning: The core file is truncated. You may need to increasethe ulimit for file and coredump, or free some space on the filesystem. [using memory image in coredir.0/core] reading symbolic information ... Trace/BPT trap in sample at line 38 in file "sample.f" 38 r(i) = u(i-2)-2*u(i)+u(i+1) (dbx)
Sample problem to illustrate debugging process • We make the array bounds correction and compile the code with no debug/check flags mpxlf –qcheck –g sample-array-correct.f • Sample output 0: Proc 0 sending 0.2980232239E-07 to proc 1 1: Proc 1 received value 0.0000000000E+00 • We fixed the array bounds problem. But something is still wrong as the send/receive data does not match. Use Totalview to debug this further.
Using Totalview • On DataStar, on the dspoe interactive node • Compile and link your code with –g, turn off any optimizations • mpxlf_r –g program.f • mpxlf90_r –g program.f • mpcc_r –g program.c • Use tvpoe wrapper to run your job. For example for a 4 processor job: • tvpoe /gpfs/projects/workshop/debug/a.out –nodes 1 –tasks_per_nodes 4 (Always use the full path to the executable) • On IA64 • The process is a little more involved. See: http://www.sdsc.edu/user_services/ia64/runjobs.html#interactive
Using Totalview : Root window • Interface lists the processes • The status of the processes is also displayed (running, breakpoint, hold etc)
Using Totalview : Process Window • Shows process specific information • The variables are listed in the stack frame. • The source code is displayed here. Breakpoints can be placed from the window • “Dive” to check values
Using Totalview : Stepping through program • Go – Start/Resume execution • Halt – Stop execution • Kill – Terminate execution • Next – Run to next line or instruction (function stepped over) • Step – Run to next line or instruction (function stepped into, execution stops within function) • Out – Execute to completion of function and return to instruction after function call. • Run To – Allows you to click on any source line and run to that point.
Using Totalview : Stepping through program • Click on line of source code to set breakpoint • Click again to clear • To follow function calls – double click on function name • Can also set watchpoints and actionpoints.
Using Totalview: Sample program Saying yes gives
Sample problem to illustrate debugging process • We make the array bounds correction and the MPI call correction. Recompile: mpxlf –qcheck –g sample-correct.f • Sample output 0: Proc 0 sending 0.2980232239E-07 to proc 1 1: Proc 1 received value 0.2980232239E-07 • We now get the expected output!
References • LLNL Totalview tutorial http://www.llnl.gov/computing/tutorials/totalview/index.html • Etnus Tutorial and Totalview guide http://www.sdsc.edu/user_services/datastar/docs/totalview/wwhelp/wwhimpl/java/html/wwhelp.htm http://www.etnus.com/ • NERSC debugging tutorial http://www.nersc.gov/nusers/help/tutorials/debug/ • Talk on debugging using totalview by Nick Wright (SDSC) http://www.sdsc.edu/us/training/workshops/2006summerinstitute/docs/SI2006_wright_debug.pdf