Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

Thank you for the advice Chris, this worked amazingly well.

Using @btime, the original code (UMF + algebraicmultigrid) took 10.20s. Changing to KrylovJL_GMRES() it takes 5.293s. Changing from algebraicmultigrid to imcompletelu it takes 590 ms!. I played around with \tau, using values from 0.1 to 10,000 and the 50 from the example works perfectly. All the results above use tspan=(0.0,1.0).

I am sticking to KenCarp4 since I plan to split the equation ASAP and will post the results.

Here’s the profile for the most optimized case, running it 100 times. I am still not very good at reading these graphs so let me know if you catch anything.

I can tell for example that there’s a lot of time spent copying (copyto!) and broadcasting (materialize!). BLAS.axpy! and BLAS.dot are also significant, so I can work on optimizing BLAS.

The next step for now is splitting the ODE and using an IMEX like you suggested. Then I will work on compiling OpenBLAS as suggested by @ImreSamu with a build optimized for AlderLake and AVX2 and we’ll see what happens.

This is fun and so educational!

EDIT: It is worth noting that the optimized code at this stage runs on the M1 Mac in 700 ms, a speedup of 16%. I should definitely be able to squeeze more performance out of this.

3 Likes