Debugging mpi programs.
Debugging mpi programs is similar to debugging any program and
generally involves a long iterative process of
editing, compiling and running.
The use of adding print statements to your code as a debugging
technique should not be underestimated. They can be used to help
monitor variables or simply to flag where in your program execution has
got to. A bisection method can be employed where one uses print
statements to gradually narrow down where exactly the program is
crashing. For debugging it is generally easiest to run with the minimum
number of
threads that you can. This will cut done on the amount of output you
get which can prove confusing. By printing out the value of
MPI_Wtime you can get an idea
of where the time is being spent in your program. It may also be easier
if you only print from one of the threads. Print
statements can be used in conjunction with any of the other techniques
I
describe later.
It depends what sort of bug you think you are looking for but in
general it speeds up compiling and makes the hunt easier if you turn
off optimisation as a first step. You can do this with -O0 (That is a capital letter O
followed by a number zero). Obviously if it is a bug that only shows up
with optimisation on you will have to investigate at what level of
optimisation your bug show up. Try compiling with
the -C option which will
check for
most array
subscripts going beyond their declared size and with -trapuv which will
find any uninitialised variables.
We also have the NAG Fortran 95 compiler that is able to perform far
more checking of your code. The NAG compiler should not be used for production work but
are a very good tool to use durign development, testign and debugging.
Use nagf95 for serial
programs and mpnagf95 for mpi
programs (It can also compile code written in C but it uses gcc to do
that).
The NAG compiler has been wrapped so that if HPCF_MODE is set (as it is
by default) you get
/opt/NAG/bin/f95 -C=all -abi=64 -gline -I/opt/acml2.5.1/include <arguments> -L/opt/NAG/fll6a21d9l/acml -lacml
These options perform a large number of compile time
checks, enables traceback for run time errors and links in the
maths libraries for you. Fuerther information about these options
and all the others can be found on the nagf95 man page.
Debuggers
If any of these techniques have shown up problems (or even if
they haven't) we can compile with the -g
option and use a debugger to get further information.
The pathscale debugger pathdb has finally reached a stage where I would
reconmend it over
gdb. It behaves in a very similar way but has a better understanding of
fortran structures and arrays. A control file allows one to submit the
job to a queue. A control file might look like
$
cat db.in
cont
where
quit
The extra carriage return between where and quit is there since where
uses a page output and waits for a carriage return before displaying
the second page. If the problem is in a subroutine more than a few
layers down you would lose the output otherwise.
This can then be run with
mpirun
-np ntasks -dbg=pathdb
a.out < db.in
If you program requires extra command line arguments these can still be
included as they were before.
If you have problems with the pathdb (For instance if you need to debug
more than 4 processes at once) then gdb can be used instead.
mpirun
-np ntasks -dbg=gdb
a.out < db.in
Both of these runs will treat all threads equally and
give you the output all in one file.
Floating point traps
On linux floating point errors are not normally trapped so your program
could have a division by zero, a floating point underflow or
overflow and the program would still run. Generally these sorts of
behaviour suggest a problem with the code and you probably are not
getting correct results. I could find no easy way to turn these traps
off in fortran but the following section of C does.
#define _GNU_SOURCE 1
#include <fenv.h>
static void __attribute__ ((constructor))
trapfpe ()
{
feenableexcept (FE_ALL_EXCEPT);
}
For convenience I have compiled this as a library so linking with
-ltrapfpe enables all floating point traps.
Alternatively Pathscale have now added options to control these masks
-TENV:simd_imask=OFF unmasks SIMD floating-point invalid-operation
exceptions
-TENV:simd_dmask=OFF unmasks SIMD floating-point denormalized-operand
exceptions
-TENV:simd_zmask=OFF unmasks SIMD floating-point zero-divide exceptions
-TENV:simd_omask=OFF unmasks SIMD floating-point overflow exceptions
-TENV:simd_umask=OFF unmasks SIMD floating-point underflow exceptions
-TENV:simd_pmask=OFF unmasks SIMD floating-point precision exceptions