Thank you for the advice Chris, this worked amazingly well.
Using @btime
, the original code (UMF + algebraicmultigrid
) took 10.20s. Changing to KrylovJL_GMRES()
it takes 5.293s. Changing from algebraicmultigrid
to imcompletelu
it takes 590 ms!. I played around with \tau, using values from 0.1 to 10,000 and the 50 from the example works perfectly. All the results above use tspan=(0.0,1.0)
.
I am sticking to KenCarp4
since I plan to split the equation ASAP and will post the results.
Here’s the profile for the most optimized case, running it 100 times. I am still not very good at reading these graphs so let me know if you catch anything.
I can tell for example that there’s a lot of time spent copying (copyto!
) and broadcasting (materialize!
). BLAS.axpy!
and BLAS.dot
are also significant, so I can work on optimizing BLAS.
The next step for now is splitting the ODE and using an IMEX like you suggested. Then I will work on compiling OpenBLAS as suggested by @ImreSamu with a build optimized for AlderLake and AVX2 and we’ll see what happens.
This is fun and so educational!
EDIT: It is worth noting that the optimized code at this stage runs on the M1 Mac in 700 ms, a speedup of 16%. I should definitely be able to squeeze more performance out of this.