About MPI

Introduction

As you may know, MPI is the abbreviation for Message Passing Interface, a standard with which the processes of parallel programs can communicate with each other. There exist two popular open-source implementations of this standard: OpenMPI and MPICH. On our cluster we will use OpenMPI. Typically, on supercomputers the hardware vendor provides their own optimized version of MPI, such as, e.g., Cray on the supercomputer at HLRS in Stuttgart.

An MPI-parallel program is a single binary that contains MPI-specific function calls. It can be launched by executing the mpirun command. On a compute cluster, the binary gets distributed to multiple compute nodes. It is then launched multiple times as separate processes with potentially multiple processes on every compute node. Consequently, we have multiple running instances of our program that do not share memory, i.e. they cannot have common variables. But they share an MPI context, the communicator. Using the communicator, the instances of the program can exchange values by sending messages to each other. This approach is called distributed-memory parallelism. It is not to be confused with shared-memory parallelism, for which, e.g., OpenMP can be used.

Each running instance is given an MPI rank number by which it can be identified in the code. Typically for a simulation in 2D space, such the NumSim exercise, the computational domain is partitioned into multiple rectangular subdomains. Each process only compute the unknowns in its own subdomain. Exchanging values between neighbouring subdomains is necessary. Neighbouring processes can be identified by their rank numbers, if a global numbering of the subdomains was defined. The canonical numbering of subdomains is the same as with entries in a 2D array: Start on the lower left subdomain with rank number 0, continue increasing with the right neighbours, then continue with the next rows, etc.

Learning MPI

To get started with MPI, this tutorial website can be a start. Begin with the MPI Hello World tutorial. Note how the compiler wrapper mpicc is used to compile the example program, which is given by C code. For C++ you can use mpic++. On the cluster you don’t need to set any environment variables for the wrappers to work.

For the tutorials using the plain wrappers is fine. For the numsim exercise you will later need to tell cmake how to handle MPI programs.

The 2nd tutorial covers sending and receiving. Also, the MPI Reduce and Allreduce tutorial is recommended. Because the explanations and examples in the linked tutorials are quite verbose you may go fast through the text.

Using CMake

Compiling the MPI program for the submission with CMake can be achieved by adding the following lines to the inner CMakeLists.txt (the one in the src directory):

find_package(MPI REQUIRED)

include_directories(${MPI_INCLUDE_PATH})
target_link_libraries(${PROJECT_NAME} ${MPI_LIBRARIES})

if(MPI_COMPILE_FLAGS)
  set_target_properties(${PROJECT_NAME} PROPERTIES
    COMPILE_FLAGS "${MPI_COMPILE_FLAGS}")
endif()

if(MPI_LINK_FLAGS)
  set_target_properties(${PROJECT_NAME} PROPERTIES
    LINK_FLAGS "${MPI_LINK_FLAGS}")
endif()

Further hints

The following MPI calls might be needed for this exercise:

MPI_Init, MPI_Finalize,
MPI_Comm_rank, MPI_Comm_size,
MPI_Reduce, MPI_Allreduce,
MPI_Send, MPI_Receive
or MPI_Isend, MPI_Ireceive, MPI_Waitall
only for debugging: MPI_Barrier, MPI_Abort

How to specify send and receive buffers

The MPI functions that send and receive data need to be given send and receive data buffers. Because MPI is a C based standard, it the functions require raw memory pointers. We agreed that we don’t want to use plain pointer in C++. The proper way is therefore to use containers of the C++ standard template library (STL) for data management and only get raw pointers when needed in the MPI calls. The following two examples demonstrate the use of the STL containers std::array and std::vector with MPI calls. The following code sends 2 double values from MPI rank 0 to rank 1:

#include <array>
...

std::array<double,2> sendBuffer = {1.0, 2.0};
MPI_Send(sendBuffer.data(), 2, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);

Note how the data is stored in the std::array sendBuffer and how its data() method is used in the MPI call.

If the number of values to send is not fixed at compile-time, e.g. because it depends on the size of the subdomain, the following example can be used:

#include <vector>
...

int nValuesToSend = 123;
std::vector<double> sendBuffer(nValuesToSend, 0);  // the ",0" initializes all values to zero, it can be omitted for speed, but then all values are uninitialized

// fill sendBuffer
// ...

MPI_Send(sendBuffer.data(), nValuesToSend, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);

Pitfalls with blocking calls

Consider the following example (It can be downloaded here), compile with mpic++ main.cpp and run with two ranks: mpirun -n 2 ./a.out.

#include <mpi.h>
#include <iostream>

#include <vector>

int main (int argc, char *argv[])
{
  MPI_Init(&argc, &argv);

  // get own rank number and total number of ranks
  int ownRankNo = 0;
  int nRanks = 0;
  MPI_Comm_rank(MPI_COMM_WORLD, &ownRankNo);
  MPI_Comm_size(MPI_COMM_WORLD, &nRanks);

  // define the number of values to send and receive in this example
  const int nValuesToSend = 1e2;
  const int nValuesToReceive = nValuesToSend;

  // allocate the send and receive buffers
  std::vector<double> sendBuffer(nValuesToSend, ownRankNo);
  std::vector<double> receiveBuffer(nValuesToReceive);

  // determine other rank
  int otherRankNo = 0;
  if (ownRankNo == 0)
    otherRankNo = 1;

  // send to other rank and receive values from other rank
  MPI_Send(sendBuffer.data(), nValuesToSend, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD);
  MPI_Recv(receiveBuffer.data(), nValuesToReceive, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

  std::cout << "Rank " << ownRankNo << "/" << nRanks << ": The first received value is " << receiveBuffer[0] << "." << std::endl;

  MPI_Finalize();
}

The code should be easily understandable. Rank 0 send an array containing 100 times the value “0” to rank 1, rank 1 sends an array containing 100 times the value “1” to rank 0. Both ranks output their first received value at the end.

If you increase the number of values to communicate in line 17, to e.g. 1e3 or further, it might no longer work. The issue is the following deadlock. The MPI_Send and MPI_Recv calls have blocking nature. The MPI_Send is only guaranteed to return as soon as the corresponding communication partner called MPI_Recv and the data can be transferred over the network. For small numbers of values, like 100 in this example, the send functions can copy the values to an internal buffer and return immediately. For larger arrays both send calls wait for the other rank to call receive.

One obvious fix would be to swap MPI_Send and MPI_Recv on one rank. This would need an if-statement on the own rank number. For 2D grids of processers the analogous case gets more complicated. Therefore, a second, more elegant solution would be to use non-blocking calls.

The non-blocking MPI calls in this case would be MPI_Isend and MPI_Irecv. The I stands for immediate. The function calls return immediately and provide a request object with which waiting for the operation to finish is possible. One difference to the blocking calls is that the send and receive buffers must be allocated until the operation has completed.

The following code demonstrates the use of non-blocking communication for the previous example.

#include <mpi.h>
#include <iostream>

#include <vector>

int main (int argc, char *argv[])
{
  MPI_Init(&argc, &argv);

  // get own rank number and total number of ranks
  int ownRankNo = 0;
  int nRanks = 0;
  MPI_Comm_rank(MPI_COMM_WORLD, &ownRankNo);
  MPI_Comm_size(MPI_COMM_WORLD, &nRanks);

  // define the number of values to send and receive in this example
  const int nValuesToSend = 1e3;
  const int nValuesToReceive = nValuesToSend;

  // allocate the send and receive buffers
  std::vector<double> sendBuffer(nValuesToSend, ownRankNo);
  std::vector<double> receiveBuffer(nValuesToReceive);

  // determine other rank
  int otherRankNo = 0;
  if (ownRankNo == 0)
    otherRankNo = 1;

  // send to other rank and receive values from other rank
  MPI_Request sendRequest;
  MPI_Request receiveRequest;

  // post all nonblocking send and receive calls, the order is not relevant
  MPI_Isend(sendBuffer.data(), nValuesToSend, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD, &sendRequest);
  MPI_Irecv(receiveBuffer.data(), nValuesToReceive, MPI_DOUBLE, otherRankNo, 0, MPI_COMM_WORLD, &receiveRequest);

  // wait for send and receive calls to complete
  MPI_Wait(&sendRequest, MPI_STATUS_IGNORE);
  MPI_Wait(&receiveRequest, MPI_STATUS_IGNORE);

  std::cout << "Rank " << ownRankNo << "/" << nRanks << ": The first received value is " << receiveBuffer[0] << "." << std::endl;

  MPI_Finalize();
}

If you want to use nonblocking calls in the NumSim exercise, you’ll probably have multiple send and receive calls. Then you should have a list of MPI_Requests and use MPI_Waitall at the end to wait for completion of all posted MPI calls:

std::vector<MPI_Request> sendRequests;  // similar for receive requests

// note that all send buffers need to be available for all send calls until they complete.
// Maybe you need to define something like
std::vector<std::vector<double>> sendBuffers;

// add a new send call
sendRequests.emplace_back();    // create new sendRequest handle

MPI_Isend(...sth. with sendBuffers ... MPI_COMM_WORLD, &sendRequests.back());

// more MPI_Isend / MPI_Irecv calls


// wait for all pending send requests to complete
MPI_Waitall(sendRequests.size(), sendRequests.data(), MPI_STATUSES_IGNORE);

// do the same for receive requests