Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

I have 2 machines:

  1. M1 Mac Mini, with horrible thermals and a peak of 3.2 GHz and 8 GB of RAM (absolutely love it though)
  2. A beefy linux (Ubuntu 21.10) workstation with 64 GB of DDR5 RAM and a 12th-Gen 4.9 GHz processor with 16 threads. (8 physical cores, efficiency cores disabled)

I have some code, seen here, that is used as a function to solve a PDE. This is the parallel version of the code, to get the serial version you obviously just remove Threads.@spawn and @sync.

Running example.jl from that repo and timing the DifferentialEquations.jl solve function, we get:

Mac Work Station Serial Work Station FLoops
t (s) ~60 ~60 ~60

As you can see, the performance is basically the same between the two machines (serial version), and the parallel version runs at the same speed as well after moving to FLoops.

I also tried

@btime rand(10000,10000)*rand(10000,10000)

The M1 takes 11.591 s and the Workstation takes 9.730 s, almost the same. Something is going on.

Here are my questions:

  1. (General) does higher clock speed always equal quicker code if everything is OK? Or are there some situations where some bottlenecks won’t be overcome by clock speed?
  2. How do I improve the serial performance? Why is a water-cooled 4.9 GHz processor barely keeping up with a 3.2 GHz processor with poor thermals?
  3. What’s wrong with my parallel implementation? I tried LoopVectorization.jl with @tturbo but couldn’t get it working at all.
  4. Is there some good benchmarking code for this?

Thanks in advance, I hope this brings up some interesting discussion and learning opportunities.

Have you tried writing this code with DifferentialEquations.jl I’d expect it to be a few orders of magnitude better since it knows fancier time stepping algorithms than you do.

The time solution is done using DifferentialEquations.jl, see here for how I do it.

This part is the spatial discretization, which is actually done in this devectorized way for performance. For example, see here. Writing it this way is 10x quicker than using DiffEqOperators or the matrix form of the Laplacian.

Don’t use @spawn like this. Use Threads.@threads or FLoops.@floop.

For an introduction to multicore parallelism in Julia, see Data-parallel programming in Julia

3 Likes

Thank you for the advice. I tried doing it like this:

function GS_Neumann0!(du,u,p,t) # Works only with square grids.
  local f, k, D₁, D₂, dx, dy, M = p
  local N = Int(M)

  @floop for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D₁*(1/dx^2*(u[i-1,j,1] + u[i+1,j,1] - 2u[i,j,1])+ 1/dy^2*(u[i,j+1,1] + u[i,j-1,1] - 2u[i,j,1])) +
                -u[i,j,1]*u[i,j,2]^2 + f*(1-u[i,j,1])
  end
  @floop for j in 2:N-1
    local i = N
    du[i,j,1] = D₁*(1/dx^2*(2u[i-1,j,1] - 2u[i,j,1])+ 1/dy^2*(u[i,j+1,1] + u[i,j-1,1] - 2u[i,j,1])) +
                -u[i,j,1]*u[i,j,2]^2 + f*(1-u[i,j,1])
  end
  # Many more of these

and I get the warning/error:

 Warning: Correctness and/or performance problem detected
│   error =
│    HasBoxedVariableError: Closure ##reducing_function#389#115 (defined in PatternFormation) has 1 boxed variable: N

I haven’t quite been able to figure it out for N
, but fixed it for i and j using local. This code runs in ~60 seconds, just like the serial version, so there should be a ton of room for improvement.

You are hitting this case: https://juliafolds.github.io/FLoops.jl/dev/explanation/faq/#uncertain-values

This is the problem:

Either use let N = N ... as mentioned in the FAQ or use

  f, k, D₁, D₂, dx, dy, Ntmp = p
  N = Int(Ntmp)

i.e., introduce Ntmp to avoid setting N twice

1 Like

Thank you, let fixed the error. The code still runs in ~60 seconds same as the sequential so I have to work on figuring that out (assuming I can figure out why the Workstation and the Mac run at the same wall time).

You can use --threads= with different numbers and see how the run-time change in the workstation. Also, monitor top/htop/etc. output see if CPUs are active as you’d expect. If the run-time does not change with varying --threads=, it’s likely that the bottleneck is elsewhere.

Bonus: you can also make the executor explicit as in

function GS_Periodic!(du,u,p,t; ex = nothing)
  ...

  @floop ex for ...

and pass ex = ThreadedEx(basesize = ...) to change the number of “effective threads” https://juliafolds.github.io/data-parallelism/howto/faq/#can_i_change_the_number_of_execution_threads_without_restarting_julia

I tried doing it like this:

N = 256 # Grid is 256x256, each un-nested for loop goes over N-2 elements
N_threads = 16
bs = (N -2)÷ N_threads
ex = ThreadedEx(basesize = bs)

And for N = 4,8,16, they all take 60 seconds. I checked the processor utilization with top and it seems that it mostly hovers at 100% and sometimes jumps to 200%, but that’s it. I assume this means it is not properly using the 16 threads I have access to. /sigh

You need to launch julia with julia --threads=16

Thank you, I have already done this prior and changed it in the VSCode settings for the remote machine (the Workstation). Running Threads.nthreads() returns 16.

As suggested by @tkf, I ran htop and monitored Julia spawning the threads, it’s just that most threads use no computing power whatsoever. Weird.

I just skimmed the code bit more. You have a lot of for loops but they don’t depend on each other and the iteration space are all the same. So, IIUC, it looks like you can rewrite them to use one for j in 2:N-1, i in 2:N-1 loop and one for j in 2:N-1 loop? Also, if N = 256 is a typical size, I’d use @floop only for the nested loop.

2 Likes

I changed the code, partially implementing your suggestions. It did not make an appreciable difference in the results. Then I changed the linear solver of DifferentialEquations.jl from linsolve=KLUFactorization to whatever the default is. The results are in the table below:

WS 1t WS 8t M1 1t M1 8t
t (s) 35.58 36.24 70.34 73.6

where WS is the workstation and M1 is the Mac. 1t is 1 thread (sequential execution). As you can see, the WS is now twice as fast! But the parallelization makes basically no difference.

So now I have to figure out why the parallelization is not working as it should.

@ChrisRackauckas sorry for the direct ping, but would you happen to know why KLU factorization causes this issue? (tl;dr KLU factorization runs at the same speed on two machines, one fast one slow. Default algorithm is twice as fast on fast machine).

I know this does not directly address your question, and maybe you know the resource already, but if not this lecture by Chris Rackauckas is really an awesome resource for improving ones understanding of exactly these kinds of problems: https://www.youtube.com/playlist?list=PLCAl7tjCwWyGjdzOOnlbGnVNZk0kB8VSa

1 Like

Thank you, I wasn’t aware of this actually. Looks fantastic!

1 Like

Did you profile? How much time is spent in the KLU? IIRC, KLU is a fairly serial factorization algorithm for sparsity patterns with low amounts of structure. The default (UMFPACK) can exploit more structure and threading, but requires some repeated structures in the sparsity pattern. Which one will be better depends on the problem and the compute hardware.

I ran a little experiment to figure out the effects of multithreading on both systems. I extended the simulation time by 100 fold and ran 2 simulations, one with 1 thread and one with 16 threads on the workstation or 8 on the Mac. I also used BLAS.set_num_threads(N_threads). I additionally started using MKL, which didn’t make much of a difference, unfortunately.

The run times are in the table below. As you can see, they are all more or less the same, so we’re back to square zero:

  1. The multithreaded code is as fast as the sequential code on the WS
  2. Parallelism cripples the Mac for some reason, but I am more worried about debugging the workstation’s performance.
  3. The tiny M1 Mac Mini runs the code as fast as the workstation, even when using 16 threads on the workstation and 1 on the Mac.
WS 1t WS 16t M1 1t M1 8t
t (s) 327 338 333 546

I was monitoring htop the whole time, and for the single threaded simulation, one thread stayed at 100% the whole time. On the other hand, in the multithreaded calculation, all 16 threads jumped around between 0 and 50%, with an average overall CPU usage of around ~400-500%.

I think it makes sense the multithreaded and sequential versions have the same run time based on this data, accounting for the parallelism overhead.

Could you post an annotated profile?

Embarrassingly, I don’t know much about profiling and don’t know how to annotate one.

However, I profiled some code on both systems and tried UMF vs KLU. There is a weird bug where @profview from VSCode does not capture the true time it takes to run the function vs @time. Thus, the M1 results below aren’t reliable until I figure out how to profile it more properly.

WS M1
Total 59 39 (55)
KLU.LibKLU.klu_l_factor 46.1 33.1
KLU.LibKLU.klu_l_solve 8.7 0.8
WS M1
Total` 32 14 (33)
SuiteSparse.UMFPACK.solve! 20.9 1.8
SuiteSparse.UMFPACK.#umfpack_numeric!#13 4.9 2.6
SuiteSparse.UMFPACK.umfpack_symbolic! 2.22 2.2

The number between parentheses for the M1 is the actual run time if I use @time. So basically both systems perform identically now with UMF being almost twice as fast.

Unfortunately, I don’t think this data is very useful due to the M1 bug, but at least it can tell us more about the WS.

EDIT: all tests were done with 1 thread and OpenBLAS.

shows how to profile. Profile a small solve. Don’t optimize what the profile doesn’t show.

3 Likes