Debugging in Parallel

Using file output

As always, using console output for debugging is possible and easy to do. It is recommended to tell with every line of output from which MPI rank it was printed.

Another option is to use the provided output writers, especially the OutputWriterTextParallel. After adapting them to your classes, you’ll see that for a given timestep it will output a separate file for each process. You can now run your program with one process, rename the out folder to out_serial and rerun it with two processes and move out to out_parallel. Then the output files can be conveniently compared using the program meld, like in the following:

meld out_serial/output_0000.txt out_parallel/output_0000.0.txt

In the resulting window you can compare the text files. Differences will be highlighted. You can also select the second parallel output file and compare it to the serial file as shown in Fig. 9 and Fig. 10.

../_images/meld1.png — Fig. 9 Comparing the serial output (left) with the first parallel output (right).

../_images/meld2.png — Fig. 10 Comparing the serial output (left) with the second parallel output (right).

Using gdb

With gdb, the parallel debugging procedure is a bit different from the serial one. We need to start the parallel program and attach the debugger later. If the process ID (PID) of the numsim_parallel program is known, gdb can be simply attached by using the -p option:

gdb -p 123   # for PID 123

Every process of the MPI program has its own PID so we need multiple instances of gdb in separate terminals to attach to every process of the parallel program. The process ID of any running program can easily be obtained by htop. In Fig. 11 you see the output of htop while numsim_parallel is running with 4 processes. The PIDs can be read in left-most column: 22719, 22717, 221718 and 22716. Exit htop by pressing q. You can now attach to the processes with gdb -p, press Ctrl+C to interrupt execution and continue to examine the program in gdb, e.g. with backtrace.

../_images/htop.png — Fig. 11 “htop” while a simulation is running.

A more elegant method to simplify the attachment procedure is by including a debugging barrier. Consider the following method:

void Partitioning::gdbParallelDebuggingBarrier()
{
  volatile int gdbResume = 0;

  if (nRanks_ > 0)
  {
    // print instructions
    int pid = getpid();
    std::cout << "Rank " << ownRankNo_ << ", PID " << pid << " is waiting for gdbResume=" << gdbResume
      << " to become 1 " << std::endl << std::endl
      << "gdb -p " << pid << std::endl << std::endl
      << "select-frame 2" << std::endl
      << "set var gdbResume = 1" << std::endl
      << "info locals " << std::endl
      << "continue" << std::endl << std::endl;

    // busy wait until the variable gdbResume was set to 1 from the debugger
    while (gdbResume == 0)
    {
      std::this_thread::sleep_for (std::chrono::milliseconds(5));
    }
    std::cout << "Rank " << ownRankNo_ << ", PID " << pid << " resumes because gdbResume=" << gdbResume;
  }
}

This method is to be called simultaneously on all processes if you want to debug the program. In line 3, the variable gdbResume is defined to be 0. The while loop in lines 18-21 waits until it has a different value. Without any external intervention, nothing would happen and the program would be trapped in this infinite loop. But you could attach gdb while all processes are waiting like this and alter the value of gdbResume to continue execution. How to do this is printed by the method in lines 9-15.

If you start a program where this method is called with 2 processes, you get an output similar to the following:

$ mpirun -n 2 ./numsim_parallel lid_driven_cavity.txt

Rank 0, PID 25734 is waiting for gdbResume=0 to become 1

gdb -p 25734

select-frame 2
set var gdbResume = 1
info locals
continue

Rank 1, PID 25735 is waiting for gdbResume=0 to become 1

gdb -p 25735

select-frame 2
set var gdbResume = 1
info locals
continue

Now you have enough time to open two new terminals (or panes in terminator).
Copy the lines from “gdb -p 25734” to the first “continue” and paste them in the first terminal.
Copy the lines from “gdb -p 25735” to the last “continue” and paste them in the second terminal.

The program now continues to execute with gdb attached to the two processes in the two terminals. If it crashes, you can get the stacktraces via “bt” in gdb.

Note that this is an officially recommended procedure how to debug MPI-parallel programs.

Using gdb and core dumps

The kernel has the ability to write a core dump when the program crashes. This consists of the current memory state of the program at the crash and can be examined with gdb later. To enable creation of core dumps, execute the following:

ulimit -c unlimited

Now if you run your parallel program and it crashes, a new file named core will be written by the system. To examine the core dump, call gdb on it:

gdb numsim_parallel core

(Replace numsim_parallel by your program’s name and core by the path to the core file.) Then you can use the normal GDB commands, e.g. backtrace.