Debugging in Parallel
-------------------------

Using file output
=====================

As always, using console output for debugging is possible and easy to do. It is recommended to tell with every line of output from which MPI rank it was printed. 

Another option is to use the provided output writers, especially the `OutputWriterTextParallel`. After adapting them to your classes, you'll see that for a given timestep it will output a separate file for each process. You can now run your program with one process, rename the `out` folder to `out_serial` and rerun it with two processes and move `out` to `out_parallel`. Then the output files can be conveniently compared using the program `meld`, like in the following:

.. code-block:: bash
  
  meld out_serial/output_0000.txt out_parallel/output_0000.0.txt

In the resulting window you can compare the text files. Differences will be highlighted. You can also select the second parallel output file and compare it to the serial file as shown in  :numref:`meld1` and  :numref:`meld2`.

.. _meld1:
.. figure:: images/meld1.png
  :width: 100%
  
  Comparing the serial output (left) with the first parallel output (right).

.. _meld2:
.. figure:: images/meld2.png
  :width: 100%
  
  Comparing the serial output (left) with the second parallel output (right).

Using gdb
===============

With `gdb`, the parallel debugging procedure is a bit different from the serial one. We need to start the parallel program and attach the debugger later. If the process ID (PID) of the `numsim_parallel` program is known, gdb can be simply attached by using the `-p` option:

.. code-block:: bash
  
  gdb -p 123   # for PID 123
  
Every process of the MPI program has its own PID so we need multiple instances of gdb in separate terminals to attach to every process of the parallel program. The process ID of any running program can easily be obtained by `htop`. In :numref:`htop` you see the output of htop while `numsim_parallel` is running with 4 processes. The PIDs can be read in left-most column: 22719, 22717, 221718 and 22716. Exit `htop` by pressing `q`. You can now attach to the processes with `gdb -p`, press `Ctrl+C` to interrupt execution and continue to examine the program in gdb, e.g. with `backtrace`.

.. _htop:
.. figure:: images/htop.png
  :width: 100%
  
  "htop" while a simulation is running.

A more elegant method to simplify the attachment procedure is by including a debugging barrier. Consider the following method:

.. code-block:: c++
  :linenos:

  void Partitioning::gdbParallelDebuggingBarrier()
  {
    volatile int gdbResume = 0;
    
    if (nRanks_ > 0)
    {
      // print instructions
      int pid = getpid();
      std::cout << "Rank " << ownRankNo_ << ", PID " << pid << " is waiting for gdbResume=" << gdbResume
        << " to become 1 " << std::endl << std::endl
        << "gdb -p " << pid << std::endl << std::endl
        << "select-frame 2" << std::endl
        << "set var gdbResume = 1" << std::endl
        << "info locals " << std::endl
        << "continue" << std::endl << std::endl;

      // busy wait until the variable gdbResume was set to 1 from the debugger
      while (gdbResume == 0)
      {
        std::this_thread::sleep_for (std::chrono::milliseconds(5));
      }
      std::cout << "Rank " << ownRankNo_ << ", PID " << pid << " resumes because gdbResume=" << gdbResume;
    }
  }
  
This method is to be called simultaneously on all processes if you want to debug the program. In line 3, the variable `gdbResume` is defined to be 0. The `while` loop in lines 18-21 waits until it has a different value. Without any external intervention, nothing would happen and the program would be trapped in this infinite loop. But you could attach gdb while all processes are waiting like this and alter the value of `gdbResume` to continue execution. How to do this is printed by the method in lines 9-15.

If you start a program where this method is called with 2 processes, you get an output similar to the following:

.. code-block:: bash
  
  $ mpirun -n 2 ./numsim_parallel lid_driven_cavity.txt 

  Rank 0, PID 25734 is waiting for gdbResume=0 to become 1 

  gdb -p 25734

  select-frame 2
  set var gdbResume = 1
  info locals 
  continue

  Rank 1, PID 25735 is waiting for gdbResume=0 to become 1 

  gdb -p 25735

  select-frame 2
  set var gdbResume = 1
  info locals 
  continue

* Now you have enough time to open two new terminals (or panes in terminator). 
* Copy the lines from "gdb -p 25734" to the first "continue" and paste them in the first terminal. 
* Copy the lines from "gdb -p 25735" to the last "continue" and paste them in the second terminal. 

The program now continues to execute with gdb attached to the two processes in the two terminals. If it crashes, you can get the stacktraces via "bt" in gdb.

Note that this is an `officially recommended procedure <https://www.open-mpi.org/faq/?category=debugging#serial-debuggers>`_ how to debug MPI-parallel programs.

Using gdb and core dumps
==============================
The kernel has the ability to write a `core dump <https://en.wikipedia.org/wiki/Core_dump>`_ when the program crashes. This consists of the current memory state of the program at the crash and can be examined with gdb later.
To enable creation of core dumps, execute the following:

.. code-block:: bash
  
  ulimit -c unlimited
  
Now if you run your parallel program and it crashes, a new file named `core` will be written by the system. To examine the core dump, call gdb on it:

.. code-block:: bash
  
  gdb numsim_parallel core
  
(Replace `numsim_parallel` by your program's name and `core` by the path to the core file.) Then you can use the normal :ref:`gdb` commands, e.g. `backtrace`.