Profiling
One goal for the second exercise is to make the program run as fast as possible. To measure the total execution time of your program you can either prepend time on the command line:
$ time mpirun -n 2 ./numsim_parallel lid_driven_cavity.txt
...
real 0m23,020s
user 0m45,108s
sys 0m0,465s
(“real” is the actual wallclock time for the program execution.) Or you can measure the runtime yourself in the program, e.g. by using MPI_Wtime.
It can be useful to know which parts of a program take most of the computational time. These can then be optimized. This optimization approach is generally independent of the parallel execution of the program.
Using Gprof
In the following we will cover how to use the tool gprof for this purpose. Gprof collects some stats while your program is running and presents them in form of a text file afterwards. In order to use it you have to compile all files with -pg -g and link with -pg. In CMake this can be done by adding a PROFILE option like this:
1option(PROFILE "Add flags to profile the program with gprof." OFF)
2if(PROFILE)
3 SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pg -g")
4 SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -pg -g")
5endif()
Then, you can configure inside the build directory with this option turned on:
$ cmake -DPROFILE=ON ..
After make install you get the new executable. It will run slower by a factor of roughly 2 but will produce runtime statistics in a binary file gmon.out. To visualize this file, use gprof with your program as argument:
$ gprof numsim_parallel > out1.txt
$ gprof -l numsim_parallel > out2.txt
Now, two text files with results were produced. out1.txt contains runtime information on function call level. out2.txt took much longer to create and contains runtime information on code line level.
The following are exemplary results. Head of file “out1.txt”:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
15.99 0.43 0.43 320326804 0.00 0.00 std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long)
15.99 0.86 0.43 320139373 0.00 0.00 std::array<int, 2ul>::operator[](unsigned long) const
15.61 1.28 0.42 67447502 0.00 0.00 Array2D::operator()(int, int)
8.18 1.50 0.22 384 0.00 0.01 SORParallel::solve()
Head of file “out2.txt”:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
8.55 0.23 0.23 std::array<int, 2ul>::operator[](unsigned long) const (array:190 @ 20ec4)
8.18 0.45 0.22 std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long) (array:56 @ 15e0d)
7.81 0.66 0.21 320326804 0.66 0.66 std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long) (array:55 @ 15df7)
7.44 0.86 0.20 320139373 0.62 0.62 std::array<int, 2ul>::operator[](unsigned long) const (array:189 @ 20eae)
4.83 0.99 0.13 Array2D::operator()(int, int) (array2d.cpp:36 @ 2fc0a)
The first column states how many percent of the total runtime was spent in either the function call or the code line. The example presented here shows that the program spends most of the time in the “[]” operator of std::array. Furthermore, the SOR solver accounts for 8% of the total runtime. If you scroll down the files, you’ll find more context on these function calls. With this information you could the further optimize the highlighted code.
More advanced optimization
If you want to understand and optimize your program even better, you could:
add more efficient compilation options to GCC,
improve vectorization (e.g. it is allowed to use #pragma opemp simd),
evaluate your progress using perf or inspecting the assembly,
improve the numerics, by more efficient abortion criteria in the solver, a CG-method etc.,
implement communication hiding.
These advanced topics are not necessary for the submission. But as we will evaluate the total runtime of your program, experts/enthusiasts in these fields will be rewarded.