Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

ash · January 26, 2022, 5:50pm

Thank you for the advice Chris, this worked amazingly well.

Using @btime, the original code (UMF + algebraicmultigrid) took 10.20s. Changing to KrylovJL_GMRES() it takes 5.293s. Changing from algebraicmultigrid to imcompletelu it takes 590 ms!. I played around with \tau, using values from 0.1 to 10,000 and the 50 from the example works perfectly. All the results above use tspan=(0.0,1.0).

I am sticking to KenCarp4 since I plan to split the equation ASAP and will post the results.

Here’s the profile for the most optimized case, running it 100 times. I am still not very good at reading these graphs so let me know if you catch anything.

I can tell for example that there’s a lot of time spent copying (copyto!) and broadcasting (materialize!). BLAS.axpy! and BLAS.dot are also significant, so I can work on optimizing BLAS.

The next step for now is splitting the ODE and using an IMEX like you suggested. Then I will work on compiling OpenBLAS as suggested by @ImreSamu with a build optimized for AlderLake and AVX2 and we’ll see what happens.

This is fun and so educational!

EDIT: It is worth noting that the optimized code at this stage runs on the M1 Mac in 700 ms, a speedup of 16%. I should definitely be able to squeeze more performance out of this.

Topic		Replies	Views
Julia multithreading is running slower than serial, can someone please explain why...? Performance multithreading , floops	15	1380	March 3, 2023
Benchmarking Parallel Computing Tools General Usage multithreading , distributed	2	573	February 25, 2021
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3522	May 16, 2021
Julia multithreading is running slower than serial, can someone please explain why…? General Usage Performance multithreading floops Performance multithreading , floops	17	875	March 31, 2023
DifferentialEquations.jl+MPI.jl+PencilArrays.jl: Lack of scaling observed Julia at Scale mpi , differentialequation	20	356	December 13, 2024

Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

Related topics