About MPI
-------------

Introduction
=============
As you may know, `MPI` is the abbreviation for `Message Passing Interface`, a standard with which the processes of parallel programs can communicate with each other.
There exist two popular open-source implementations of this standard: OpenMPI and MPICH. On our cluster we will use OpenMPI. Typically, on supercomputers the hardware vendor provides their own optimized version of MPI, such as, e.g., Cray on the supercomputer at HLRS in Stuttgart.

An MPI-parallel program is a single binary that contains MPI-specific function calls. It can be launched by executing the `mpirun` command. On a compute cluster, the binary gets distributed to multiple compute nodes. It is then launched multiple times as separate processes with potentially multiple processes on every compute node. Consequently, we have multiple running instances of our program that do not share memory, i.e. they cannot have common variables. But they share an MPI context, the `communicator`. Using the communicator, the instances of the program can exchange values by sending messages to each other. This approach is called `distributed-memory` parallelism. It is not to be confused with `shared-memory` parallelism, for which, e.g., `OpenMP` can be used.

Each running instance is given an MPI `rank` number by which it can be identified in the code. Typically for a simulation in 2D space, such the NumSim exercise, the computational domain is partitioned into multiple rectangular subdomains. Each process only compute the unknowns in its own subdomain. Exchanging values between neighbouring subdomains is necessary. Neighbouring processes can be identified by their rank numbers, if a global numbering of the subdomains was defined. The canonical numbering of subdomains is the same as with entries in a 2D array: Start on the lower left subdomain with rank number 0, continue increasing with the right neighbours, then continue with the next rows, etc.

Learning MPI 
==============
To get started with MPI, `this tutorial website <https://mpitutorial.com/tutorials/>`_ can be a start. Begin with the `MPI Hello World tutorial <https://mpitutorial.com/tutorials/mpi-hello-world/>`_.
Note how the compiler wrapper `mpicc` is used to compile the example program, which is given by C code. For C++ you can use `mpic++`. On the cluster you don't need to set any environment variables for the wrappers to work.

For the tutorials using the plain wrappers is fine. For the numsim exercise you will later need to tell cmake how to handle MPI programs.

The `2nd tutorial <https://mpitutorial.com/tutorials/mpi-send-and-receive/>`_ covers sending and receiving. Also, the `MPI Reduce and Allreduce <https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/>`_ tutorial is recommended. Because the explanations and examples in the linked tutorials are quite verbose you may go fast through the text.

Using CMake
=============
Compiling the MPI program for the submission with CMake can be achieved by adding the following lines to the inner `CMakeLists.txt` (the one in the `src` directory):


.. code-block:: cmake
  :linenos:
    
  find_package(MPI REQUIRED)

  include_directories(${MPI_INCLUDE_PATH})
  target_link_libraries(${PROJECT_NAME} ${MPI_LIBRARIES})

  if(MPI_COMPILE_FLAGS)
    set_target_properties(${PROJECT_NAME} PROPERTIES
      COMPILE_FLAGS "${MPI_COMPILE_FLAGS}")
  endif()

  if(MPI_LINK_FLAGS)
    set_target_properties(${PROJECT_NAME} PROPERTIES
      LINK_FLAGS "${MPI_LINK_FLAGS}")
  endif()

Further hints
================

The following MPI calls might be needed for this exercise:

.. code-block:: bash
    
    MPI_Init, MPI_Finalize,
    MPI_Comm_rank, MPI_Comm_size,
    MPI_Reduce, MPI_Allreduce,
    MPI_Send, MPI_Receive
    or MPI_Isend, MPI_Ireceive, MPI_Waitall
    only for debugging: MPI_Barrier, MPI_Abort
    
How to specify send and receive buffers
=============================================

The MPI functions that send and receive data need to be given send and receive data buffers. Because MPI is a C based standard, it the functions require raw memory pointers. 
We agreed that we don't want to use plain pointer in C++. The proper way is therefore to use containers of the C++ standard template library (STL) for data management and only get raw pointers when needed in the MPI calls. The following two examples demonstrate the use of the STL containers `std::array` and `std::vector` with MPI calls.
The following code sends 2 double values from MPI rank 0 to rank 1:

.. code-block:: c++

  #include <array>
  ...
  
  std::array<double,2> sendBuffer = {1.0, 2.0};
  MPI_Send(sendBuffer.data(), 2, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);

Note how the data is stored in the `std::array` `sendBuffer` and how its `data()` method is used in the MPI call.

If the number of values to send is not fixed at compile-time, e.g. because it depends on the size of the subdomain, the following example can be used:

.. code-block:: c++

  #include <vector>
  ...
  
  int nValuesToSend = 123;
  std::vector<double> sendBuffer(nValuesToSend, 0);  // the ",0" initializes all values to zero, it can be omitted for speed, but then all values are uninitialized
  
  // fill sendBuffer
  // ...
  
  MPI_Send(sendBuffer.data(), nValuesToSend, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);

Pitfalls with blocking calls
=================================
    
Consider the following example (It can be downloaded `here <https://gitlab.com/ipvs_sgs/numsim_exercises/raw/master/exercise2/resources/main.cpp>`_), compile with `mpic++ main.cpp` and run with two ranks: `mpirun -n 2 ./a.out`.

.. code-block:: c++
  :linenos:

  #include <mpi.h>
  #include <iostream>

  #include <vector>

  int main (int argc, char *argv[])
  {
    MPI_Init(&argc, &argv);
    
    // get own rank number and total number of ranks
    int ownRankNo = 0;
    int nRanks = 0;
    MPI_Comm_rank(MPI_COMM_WORLD, &ownRankNo);
    MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
    
    // define the number of values to send and receive in this example
    const int nValuesToSend = 1e2;
    const int nValuesToReceive = nValuesToSend;
    
    // allocate the send and receive buffers
    std::vector<double> sendBuffer(nValuesToSend, ownRankNo);
    std::vector<double> receiveBuffer(nValuesToReceive);
    
    // determine other rank
    int otherRankNo = 0;
    if (ownRankNo == 0)
      otherRankNo = 1;
    
    // send to other rank and receive values from other rank
    MPI_Send(sendBuffer.data(), nValuesToSend, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD);
    MPI_Recv(receiveBuffer.data(), nValuesToReceive, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    
    std::cout << "Rank " << ownRankNo << "/" << nRanks << ": The first received value is " << receiveBuffer[0] << "." << std::endl;
    
    MPI_Finalize();
  }

The code should be easily understandable. Rank 0 send an array containing 100 times the value "0" to rank 1, rank 1 sends an array containing 100 times the value "1" to rank 0. Both ranks output their first received value at the end.

If you increase the number of values to communicate in line 17, to e.g. `1e3` or further, it might no longer work. The issue is the following deadlock. The `MPI_Send` and `MPI_Recv` calls have blocking nature. The `MPI_Send` is only guaranteed to return as soon as the corresponding communication partner called `MPI_Recv` and the data can be transferred over the network. For small numbers of values, like 100 in this example, the send functions can copy the values to an internal buffer and return immediately. For larger arrays both send calls wait for the other rank to call receive.

One obvious fix would be to swap `MPI_Send` and `MPI_Recv` on one rank. This would need an `if`-statement on the own rank number. For 2D grids of processers the analogous case gets more complicated. Therefore, a second, more elegant solution would be to use non-blocking calls.

The non-blocking MPI calls in this case would be `MPI_Isend` and `MPI_Irecv`. The `I` stands for `immediate`. The function calls return immediately and provide a request object with which waiting for the operation to finish is possible. One difference to the blocking calls is that the send and receive buffers must be allocated until the operation has completed. 

The following code demonstrates the use of non-blocking communication for the previous example.

.. code-block:: c++
  :linenos:

  #include <mpi.h>
  #include <iostream>

  #include <vector>

  int main (int argc, char *argv[])
  {
    MPI_Init(&argc, &argv);
    
    // get own rank number and total number of ranks
    int ownRankNo = 0;
    int nRanks = 0;
    MPI_Comm_rank(MPI_COMM_WORLD, &ownRankNo);
    MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
    
    // define the number of values to send and receive in this example
    const int nValuesToSend = 1e3;
    const int nValuesToReceive = nValuesToSend;
    
    // allocate the send and receive buffers
    std::vector<double> sendBuffer(nValuesToSend, ownRankNo);
    std::vector<double> receiveBuffer(nValuesToReceive);
    
    // determine other rank
    int otherRankNo = 0;
    if (ownRankNo == 0)
      otherRankNo = 1;
    
    // send to other rank and receive values from other rank
    MPI_Request sendRequest;
    MPI_Request receiveRequest;
    
    // post all nonblocking send and receive calls, the order is not relevant
    MPI_Isend(sendBuffer.data(), nValuesToSend, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD, &sendRequest);
    MPI_Irecv(receiveBuffer.data(), nValuesToReceive, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD, &receiveRequest);
    
    // wait for send and receive calls to complete
    MPI_Wait(&sendRequest, MPI_STATUS_IGNORE);
    MPI_Wait(&receiveRequest, MPI_STATUS_IGNORE);
    
    std::cout << "Rank " << ownRankNo << "/" << nRanks << ": The first received value is " << receiveBuffer[0] << "." << std::endl;
    
    MPI_Finalize();
  }

If you want to use nonblocking calls in the NumSim exercise, you'll probably have multiple send and receive calls. Then you should have a list of MPI_Requests and use `MPI_Waitall` at the end to wait for completion of all posted MPI calls:

.. code-block:: c++
  :linenos:

  std::vector<MPI_Request> sendRequests;  // similar for receive requests
  
  // note that all send buffers need to be available for all send calls until they complete.
  // Maybe you need to define something like
  std::vector<std::vector<double>> sendBuffers;
  
  // add a new send call
  sendRequests.emplace_back();    // create new sendRequest handle
  
  MPI_Isend(...sth. with sendBuffers ... MPI_COMM_WORLD, &sendRequests.back());
  
  // more MPI_Isend / MPI_Irecv calls
  
  
  // wait for all pending send requests to complete
  MPI_Waitall(sendRequests.size(), sendRequests.data(), MPI_STATUSES_IGNORE);
  
  // do the same for receive requests