The main purpose of this thread was to find ways to optimize the serial code rather than focusing on parallelization either by Threads
or using Distributed
. With the code finally running faster than the serial Fortran equivalent on my personal PC , I think the goal (at least in terms of programming optimization of the serial code) has been achieved. Hence I am marking this threads as solved. Huge thanks to @ufechner7 @DNF @nilshg @abraemer and others for their constant support.
Further questions deal with parallelizing the code and running it on supercomputers. I have opened a new thread to carry on the further discussion.