Debugging in Parallel
Using file output
As always, using console output for debugging is possible and easy to do. It is recommended to tell with every line of output from which MPI rank it was printed.
Another option is to use the provided output writers, especially the OutputWriterTextParallel. After adapting them to your classes, you’ll see that for a given timestep it will output a separate file for each process. You can now run your program with one process, rename the out folder to out_serial and rerun it with two processes and move out to out_parallel. Then the output files can be conveniently compared using the program meld, like in the following:
meld out_serial/output_0000.txt out_parallel/output_0000.0.txt
In the resulting window you can compare the text files. Differences will be highlighted. You can also select the second parallel output file and compare it to the serial file as shown in Fig. 9 and Fig. 10.
Fig. 9 Comparing the serial output (left) with the first parallel output (right).
Fig. 10 Comparing the serial output (left) with the second parallel output (right).
Using gdb
With gdb, the parallel debugging procedure is a bit different from the serial one. We need to start the parallel program and attach the debugger later. If the process ID (PID) of the numsim_parallel program is known, gdb can be simply attached by using the -p option:
gdb -p 123 # for PID 123
Every process of the MPI program has its own PID so we need multiple instances of gdb in separate terminals to attach to every process of the parallel program. The process ID of any running program can easily be obtained by htop. In Fig. 11 you see the output of htop while numsim_parallel is running with 4 processes. The PIDs can be read in left-most column: 22719, 22717, 221718 and 22716. Exit htop by pressing q. You can now attach to the processes with gdb -p, press Ctrl+C to interrupt execution and continue to examine the program in gdb, e.g. with backtrace.
Fig. 11 “htop” while a simulation is running.
A more elegant method to simplify the attachment procedure is by including a debugging barrier. Consider the following method:
1void Partitioning::gdbParallelDebuggingBarrier()
2{
3 volatile int gdbResume = 0;
4
5 if (nRanks_ > 0)
6 {
7 // print instructions
8 int pid = getpid();
9 std::cout << "Rank " << ownRankNo_ << ", PID " << pid << " is waiting for gdbResume=" << gdbResume
10 << " to become 1 " << std::endl << std::endl
11 << "gdb -p " << pid << std::endl << std::endl
12 << "select-frame 2" << std::endl
13 << "set var gdbResume = 1" << std::endl
14 << "info locals " << std::endl
15 << "continue" << std::endl << std::endl;
16
17 // busy wait until the variable gdbResume was set to 1 from the debugger
18 while (gdbResume == 0)
19 {
20 std::this_thread::sleep_for (std::chrono::milliseconds(5));
21 }
22 std::cout << "Rank " << ownRankNo_ << ", PID " << pid << " resumes because gdbResume=" << gdbResume;
23 }
24}
This method is to be called simultaneously on all processes if you want to debug the program. In line 3, the variable gdbResume is defined to be 0. The while loop in lines 18-21 waits until it has a different value. Without any external intervention, nothing would happen and the program would be trapped in this infinite loop. But you could attach gdb while all processes are waiting like this and alter the value of gdbResume to continue execution. How to do this is printed by the method in lines 9-15.
If you start a program where this method is called with 2 processes, you get an output similar to the following:
$ mpirun -n 2 ./numsim_parallel lid_driven_cavity.txt
Rank 0, PID 25734 is waiting for gdbResume=0 to become 1
gdb -p 25734
select-frame 2
set var gdbResume = 1
info locals
continue
Rank 1, PID 25735 is waiting for gdbResume=0 to become 1
gdb -p 25735
select-frame 2
set var gdbResume = 1
info locals
continue
Now you have enough time to open two new terminals (or panes in terminator).
Copy the lines from “gdb -p 25734” to the first “continue” and paste them in the first terminal.
Copy the lines from “gdb -p 25735” to the last “continue” and paste them in the second terminal.
The program now continues to execute with gdb attached to the two processes in the two terminals. If it crashes, you can get the stacktraces via “bt” in gdb.
Note that this is an officially recommended procedure how to debug MPI-parallel programs.
Using gdb and core dumps
The kernel has the ability to write a core dump when the program crashes. This consists of the current memory state of the program at the crash and can be examined with gdb later. To enable creation of core dumps, execute the following:
ulimit -c unlimited
Now if you run your parallel program and it crashes, a new file named core will be written by the system. To examine the core dump, call gdb on it:
gdb numsim_parallel core
(Replace numsim_parallel by your program’s name and core by the path to the core file.) Then you can use the normal GDB commands, e.g. backtrace.