Debugging mpi programs.
Debugging mpi programs is similar to debugging any program and
generally involves a long iterative process of
editing, compiling and running.
The use of adding print statements to your code as a debugging
technique should not be underestimated. They can be used to help
monitor variables or simply to flag where in your program execution has
got to. A bisection method can be employed where one uses print
statements to gradually narrow down where exactly the program is
crashing. For debugging it is generally easiest to run with the minimum
number of
threads that you can. This will cut done on the amount of output you
get which can prove confusing. By prining out the value of
MPI_Wtime you can get an idea
of where the time is being spent in your program. It may also be easier
if you only print from one of the threads. Print
statements can be used in conjuction with any of the other techniques I
describe later.
It depends what sort of bug you think you are looking for but in
general it speeds up compiling and makes the hunt easier if you turn
off optimisation as a first step. You can do this with -O0 (That is a capital letter O
follwed by a number zero). Obviously if it is a bug that only shows up
with optimisation on you will have to investigate at what level of
optimisation your bug show up. If your bug still shows up with no
optimisation then try compiling with
the environment variable HPCF_VERBOSE=all
this sets -xcheck=stkovf -fpover -u
-ansi. Having fixed any problems this shows up compiling with
the -C option will check for
most array
subscripts going beyond their declared size. If this has shown up no
problems then now we compile with the -g
option and use a debugger.
dbx is the debugger most users are familiar with. This can be used
interactively
mprun
-np ntasks -o dbx a.out
but a control file might make it easier and
allows one to submit the job to a queue. This may be needed if
the time and memory limit on the login machines make it difficult to
debug your program interactively. It does of course have the drawback
that you have to repeate the whole run if there is another command you
want to run in dbx. A control file might look like
$
cat dbx.in
catch
FPE
catch SIGSEGV
catch SIGBUS
run inputfile
where
dump
quit
This can then be run with
mprun
-np ntasks -o dbx a.out < dbx.in
This gives you the output from all the threads which is pretty
confusing and a lot of it may prove redundant. It might be easier to
just look at a few of the MPI processes.
$
cat rundebug.ksh
#!/bin/ksh
# mechanism to restrict debugging to a subset of MPI processes ...
if [[ $MP_RANK < 2 ]]
then
dbx a.out < dbx.in > dbx.out.t$MP_RANK
mpkill -9 $MP_JOBID
else
a.out
fi
This script needs to be an executable so alter it with
chmod +x rundebug.ksh
Then run it with
mprun -np ntasks rundebug.ksh
The alternative to dbx is prism. This is a graphical
tool
that provides
most of the functionality of dbx but in a bit more of a friendly way.
To run your job with prism
mprun -np ntasks prism a.out
Then when prism has fired up you need to hit
run. Programs running under prism take longer to run than under
dbx but some people find the interface more intuative.