I tried profile but in the trace the maximum number i found was 5. Do I have to increase the amount of measurements in this case? The runtime of one evaluation is around 300ms.
In the optimized python code I work a lot with vectorization using numpy as well as cython to avoid overhead on the inner loop. The matrices can be quite small 20x20, but in some cases can also be around 200x200. So not really big, I guess.
I’ll try to allocate aa,bb,cc outside the loop. Thanks for the hint.
You can check how many allocations fox_goodwin_step! does using @time fox_goodwin_step!(...) (just keep in mind the first run will be contaminated by JIT compilation overhead, both in time and allocations).
If you make it allocation free that will help, but by the looks will uglify the implementation. You’ll need:
some named working arrays (aa,bb,cc at least)
more in place broadcasting with .=
LinearAlgebra.mul! for in place matrix multiplication
Probably lu! plus ldiv! to replace the \
Maybe some manual loops for adding to the diagonals in place, I couldn’t see how to do this with stdlib LinearAlgebra