Profiling

One goal for the second exercise is to make the program run as fast as possible. To measure the total execution time of your program you can either prepend time on the command line:

$ time mpirun -n 2 ./numsim_parallel lid_driven_cavity.txt
...

real  0m23,020s
user  0m45,108s
sys   0m0,465s

(“real” is the actual wallclock time for the program execution.) Or you can measure the runtime yourself in the program, e.g. by using MPI_Wtime.

It can be useful to know which parts of a program take most of the computational time. These can then be optimized. This optimization approach is generally independent of the parallel execution of the program.

Using Gprof

In the following we will cover how to use the tool gprof for this purpose. Gprof collects some stats while your program is running and presents them in form of a text file afterwards. In order to use it you have to compile all files with -pg -g and link with -pg. In CMake this can be done by adding a PROFILE option like this:

option(PROFILE "Add flags to profile the program with gprof." OFF)
if(PROFILE)
  SET(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS} -pg -g")
  SET(CMAKE_EXE_LINKER_FLAGS  "${CMAKE_EXE_LINKER_FLAGS} -pg -g")
endif()

Then, you can configure inside the build directory with this option turned on:

$ cmake -DPROFILE=ON ..

After make install you get the new executable. It will run slower by a factor of roughly 2 but will produce runtime statistics in a binary file gmon.out. To visualize this file, use gprof with your program as argument:

$ gprof numsim_parallel > out1.txt
$ gprof -l numsim_parallel > out2.txt

Now, two text files with results were produced. out1.txt contains runtime information on function call level. out2.txt took much longer to create and contains runtime information on code line level.

The following are exemplary results. Head of file “out1.txt”:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 15.99      0.43     0.43 320326804     0.00     0.00  std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long)
 15.99      0.86     0.43 320139373     0.00     0.00  std::array<int, 2ul>::operator[](unsigned long) const
 15.61      1.28     0.42 67447502     0.00     0.00  Array2D::operator()(int, int)
  8.18      1.50     0.22      384     0.00     0.01  SORParallel::solve()

Head of file “out2.txt”:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
  8.55      0.23     0.23                             std::array<int, 2ul>::operator[](unsigned long) const (array:190 @ 20ec4)
  8.18      0.45     0.22                             std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long) (array:56 @ 15e0d)
  7.81      0.66     0.21 320326804     0.66     0.66  std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long) (array:55 @ 15df7)
  7.44      0.86     0.20 320139373     0.62     0.62  std::array<int, 2ul>::operator[](unsigned long) const (array:189 @ 20eae)
  4.83      0.99     0.13                             Array2D::operator()(int, int) (array2d.cpp:36 @ 2fc0a)

The first column states how many percent of the total runtime was spent in either the function call or the code line. The example presented here shows that the program spends most of the time in the “[]” operator of std::array. Furthermore, the SOR solver accounts for 8% of the total runtime. If you scroll down the files, you’ll find more context on these function calls. With this information you could the further optimize the highlighted code.

More advanced optimization

If you want to understand and optimize your program even better, you could:

add more efficient compilation options to GCC,
improve vectorization (e.g. it is allowed to use #pragma opemp simd),
evaluate your progress using perf or inspecting the assembly,
improve the numerics, by more efficient abortion criteria in the solver, a CG-method etc.,
implement communication hiding.

These advanced topics are not necessary for the submission. But as we will evaluate the total runtime of your program, experts/enthusiasts in these fields will be rewarded.