How to improve performance in a function that repeatedly defines and multiplies matrices

The main purpose of this thread was to find ways to optimize the serial code rather than focusing on parallelization either by Threads or using Distributed . With the code finally running faster than the serial Fortran equivalent on my personal PC , I think the goal (at least in terms of programming optimization of the serial code) has been achieved. Hence I am marking this threads as solved. Huge thanks to @ufechner7 @DNF @nilshg @abraemer and others for their constant support.

Further questions deal with parallelizing the code and running it on supercomputers. I have opened a new thread to carry on the further discussion.