Profiling
-------------------------

One goal for the second exercise is to make the program run as fast as possible. 
To measure the total execution time of your program you can either prepend `time` on the command line:

.. code-block:: bash
  
  $ time mpirun -n 2 ./numsim_parallel lid_driven_cavity.txt
  ...
  
  real	0m23,020s
  user	0m45,108s
  sys	0m0,465s

("real" is the actual wallclock time for the program execution.)
Or you can measure the runtime yourself in the program, e.g. by using `MPI_Wtime <https://www.mpich.org/static/docs/v3.2/www3/MPI_Wtime.html>`_.

It can be useful to know which parts of a program take most of the computational time. These can then be optimized. This optimization approach is generally independent of the parallel execution of the program.

Using Gprof
============

In the following we will cover how to use the tool `gprof` for this purpose. `Gprof` collects some stats while your program is running and presents them in form of a text file afterwards. In order to use it you have to compile all files with `-pg -g` and link with `-pg`. In CMake this can be done by adding a PROFILE option like this:

.. code-block:: cmake
  :linenos:

  option(PROFILE "Add flags to profile the program with gprof." OFF)
  if(PROFILE)
    SET(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS} -pg -g")
    SET(CMAKE_EXE_LINKER_FLAGS  "${CMAKE_EXE_LINKER_FLAGS} -pg -g")
  endif()

Then, you can configure inside the `build` directory with this option turned on:

.. code-block:: bash
  
  $ cmake -DPROFILE=ON ..

After `make install` you get the new executable. It will run slower by a factor of roughly 2 but will produce runtime statistics in a binary file `gmon.out`. 
To visualize this file, use gprof with your program as argument:

.. code-block:: bash
    
  $ gprof numsim_parallel > out1.txt
  $ gprof -l numsim_parallel > out2.txt

Now, two text files with results were produced. `out1.txt` contains runtime information on function call level. `out2.txt` took much longer to create and contains runtime information on code line level.

The following are exemplary results. Head of file "out1.txt":

.. code-block:: bash
      
  Flat profile:

  Each sample counts as 0.01 seconds.
    %   cumulative   self              self     total           
   time   seconds   seconds    calls   s/call   s/call  name    
   15.99      0.43     0.43 320326804     0.00     0.00  std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long)
   15.99      0.86     0.43 320139373     0.00     0.00  std::array<int, 2ul>::operator[](unsigned long) const
   15.61      1.28     0.42 67447502     0.00     0.00  Array2D::operator()(int, int)
    8.18      1.50     0.22      384     0.00     0.01  SORParallel::solve()
    
Head of file "out2.txt":

.. code-block:: bash
      
  Flat profile:

  Each sample counts as 0.01 seconds.
    %   cumulative   self              self     total           
   time   seconds   seconds    calls  ns/call  ns/call  name    
    8.55      0.23     0.23                             std::array<int, 2ul>::operator[](unsigned long) const (array:190 @ 20ec4)
    8.18      0.45     0.22                             std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long) (array:56 @ 15e0d)
    7.81      0.66     0.21 320326804     0.66     0.66  std::__array_traits<int, 2ul>::_S_ref(int const (&) [2], unsigned long) (array:55 @ 15df7)
    7.44      0.86     0.20 320139373     0.62     0.62  std::array<int, 2ul>::operator[](unsigned long) const (array:189 @ 20eae)
    4.83      0.99     0.13                             Array2D::operator()(int, int) (array2d.cpp:36 @ 2fc0a)

The first column states how many percent of the total runtime was spent in either the function call or the code line. The example presented here shows that the program spends most of the time in the "[]" operator of `std::array`. Furthermore, the SOR solver accounts for 8% of the total runtime. If you scroll down the files, you'll find more context on these function calls. With this information you could the further optimize the highlighted code.

More advanced optimization
=============================
If you want to understand and optimize your program even better, you could:

* add `more efficient compilation options <https://stackoverflow.com/questions/14492436/g-optimization-beyond-o3-ofast>`_ to GCC, 
* improve `vectorization <https://stackoverflow.com/questions/1422149/what-is-vectorization>`_ (e.g. it is allowed to use `#pragma opemp simd <https://bisqwit.iki.fi/story/howto/openmp/#SimdConstructOpenmp%204%200>`_),
* evaluate your progress using `perf <https://perf.wiki.kernel.org/index.php/Tutorial>`_ or `inspecting the assembly <https://stackoverflow.com/questions/840321/how-can-i-see-the-assembly-code-for-a-c-program>`_,
* improve the numerics, by more efficient abortion criteria in the solver, a CG-method etc.,
* implement communication hiding.

These advanced topics are not necessary for the submission. But as we will evaluate the total runtime of your program, experts/enthusiasts in these fields will be rewarded.