Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

ash · January 26, 2022, 2:30am

I have 2 machines:

M1 Mac Mini, with horrible thermals and a peak of 3.2 GHz and 8 GB of RAM (absolutely love it though)
A beefy linux (Ubuntu 21.10) workstation with 64 GB of DDR5 RAM and a 12th-Gen 4.9 GHz processor with 16 threads. (8 physical cores, efficiency cores disabled)

I have some code, seen here, that is used as a function to solve a PDE. This is the parallel version of the code, to get the serial version you obviously just remove Threads.@spawn and @sync.

Running example.jl from that repo and timing the DifferentialEquations.jl solve function, we get:

	Mac	Work Station Serial	Work Station FLoops
t (s)	~60	~60	~60

As you can see, the performance is basically the same between the two machines (serial version), and the parallel version runs at the same speed as well after moving to FLoops.

I also tried

@btime rand(10000,10000)*rand(10000,10000)

The M1 takes 11.591 s and the Workstation takes 9.730 s, almost the same. Something is going on.

Here are my questions:

(General) does higher clock speed always equal quicker code if everything is OK? Or are there some situations where some bottlenecks won’t be overcome by clock speed?
How do I improve the serial performance? Why is a water-cooled 4.9 GHz processor barely keeping up with a 3.2 GHz processor with poor thermals?
What’s wrong with my parallel implementation? I tried LoopVectorization.jl with @tturbo but couldn’t get it working at all.
Is there some good benchmarking code for this?

Thanks in advance, I hope this brings up some interesting discussion and learning opportunities.

Oscar_Smith · January 26, 2022, 2:34am

Have you tried writing this code with DifferentialEquations.jl I’d expect it to be a few orders of magnitude better since it knows fancier time stepping algorithms than you do.

ash · January 26, 2022, 2:42am

The time solution is done using DifferentialEquations.jl, see here for how I do it.

This part is the spatial discretization, which is actually done in this devectorized way for performance. For example, see here. Writing it this way is 10x quicker than using DiffEqOperators or the matrix form of the Laplacian.

tkf · January 26, 2022, 3:05am

Don’t use @spawn like this. Use Threads.@threads or FLoops.@floop.

For an introduction to multicore parallelism in Julia, see Data-parallel programming in Julia

ash · January 26, 2022, 4:14am

Thank you for the advice. I tried doing it like this:

function GS_Neumann0!(du,u,p,t) # Works only with square grids.
  local f, k, D₁, D₂, dx, dy, M = p
  local N = Int(M)

  @floop for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D₁*(1/dx^2*(u[i-1,j,1] + u[i+1,j,1] - 2u[i,j,1])+ 1/dy^2*(u[i,j+1,1] + u[i,j-1,1] - 2u[i,j,1])) +
                -u[i,j,1]*u[i,j,2]^2 + f*(1-u[i,j,1])
  end
  @floop for j in 2:N-1
    local i = N
    du[i,j,1] = D₁*(1/dx^2*(2u[i-1,j,1] - 2u[i,j,1])+ 1/dy^2*(u[i,j+1,1] + u[i,j-1,1] - 2u[i,j,1])) +
                -u[i,j,1]*u[i,j,2]^2 + f*(1-u[i,j,1])
  end
  # Many more of these

and I get the warning/error:

 Warning: Correctness and/or performance problem detected
│   error =
│    HasBoxedVariableError: Closure ##reducing_function#389#115 (defined in PatternFormation) has 1 boxed variable: N

I haven’t quite been able to figure it out for N
, but fixed it for i and j using local. This code runs in ~60 seconds, just like the serial version, so there should be a ton of room for improvement.

tkf · January 26, 2022, 4:20am

You are hitting this case: https://juliafolds.github.io/FLoops.jl/dev/explanation/faq/#uncertain-values

This is the problem:

https://github.com/oashour/PatternFormation.jl/blob/1768c25b1f465b3980f9f1c7baaf865b198cea23/src/PDEDefs.jl#L2-L4

Either use let N = N ... as mentioned in the FAQ or use

  f, k, D₁, D₂, dx, dy, Ntmp = p
  N = Int(Ntmp)

i.e., introduce Ntmp to avoid setting N twice

ash · January 26, 2022, 4:31am

Thank you, let fixed the error. The code still runs in ~60 seconds same as the sequential so I have to work on figuring that out (assuming I can figure out why the Workstation and the Mac run at the same wall time).

tkf · January 26, 2022, 4:46am

You can use --threads= with different numbers and see how the run-time change in the workstation. Also, monitor top/htop/etc. output see if CPUs are active as you’d expect. If the run-time does not change with varying --threads=, it’s likely that the bottleneck is elsewhere.

Bonus: you can also make the executor explicit as in

function GS_Periodic!(du,u,p,t; ex = nothing)
  ...

  @floop ex for ...

and pass ex = ThreadedEx(basesize = ...) to change the number of “effective threads” https://juliafolds.github.io/data-parallelism/howto/faq/#can_i_change_the_number_of_execution_threads_without_restarting_julia

ash · January 26, 2022, 5:10am

I tried doing it like this:

N = 256 # Grid is 256x256, each un-nested for loop goes over N-2 elements
N_threads = 16
bs = (N -2)÷ N_threads
ex = ThreadedEx(basesize = bs)

And for N = 4,8,16, they all take 60 seconds. I checked the processor utilization with top and it seems that it mostly hovers at 100% and sometimes jumps to 200%, but that’s it. I assume this means it is not properly using the 16 threads I have access to. /sigh

Oscar_Smith · January 26, 2022, 5:18am

You need to launch julia with julia --threads=16

ash · January 26, 2022, 5:29am

Thank you, I have already done this prior and changed it in the VSCode settings for the remote machine (the Workstation). Running Threads.nthreads() returns 16.

As suggested by @tkf, I ran htop and monitored Julia spawning the threads, it’s just that most threads use no computing power whatsoever. Weird.

tkf · January 26, 2022, 5:48am

I just skimmed the code bit more. You have a lot of for loops but they don’t depend on each other and the iteration space are all the same. So, IIUC, it looks like you can rewrite them to use one for j in 2:N-1, i in 2:N-1 loop and one for j in 2:N-1 loop? Also, if N = 256 is a typical size, I’d use @floop only for the nested loop.

ash · January 26, 2022, 7:53am

I changed the code, partially implementing your suggestions. It did not make an appreciable difference in the results. Then I changed the linear solver of DifferentialEquations.jl from linsolve=KLUFactorization to whatever the default is. The results are in the table below:

	WS 1t	WS 8t	M1 1t	M1 8t
t (s)	35.58	36.24	70.34	73.6

where WS is the workstation and M1 is the Mac. 1t is 1 thread (sequential execution). As you can see, the WS is now twice as fast! But the parallelization makes basically no difference.

So now I have to figure out why the parallelization is not working as it should.

@ChrisRackauckas sorry for the direct ping, but would you happen to know why KLU factorization causes this issue? (tl;dr KLU factorization runs at the same speed on two machines, one fast one slow. Default algorithm is twice as fast on fast machine).

trahflow · January 26, 2022, 9:11am

I know this does not directly address your question, and maybe you know the resource already, but if not this lecture by Chris Rackauckas is really an awesome resource for improving ones understanding of exactly these kinds of problems: https://www.youtube.com/playlist?list=PLCAl7tjCwWyGjdzOOnlbGnVNZk0kB8VSa

ash · January 26, 2022, 9:40am

Thank you, I wasn’t aware of this actually. Looks fantastic!

ChrisRackauckas · January 26, 2022, 9:46am

Did you profile? How much time is spent in the KLU? IIRC, KLU is a fairly serial factorization algorithm for sparsity patterns with low amounts of structure. The default (UMFPACK) can exploit more structure and threading, but requires some repeated structures in the sparsity pattern. Which one will be better depends on the problem and the compute hardware.

ash · January 26, 2022, 10:11am

I ran a little experiment to figure out the effects of multithreading on both systems. I extended the simulation time by 100 fold and ran 2 simulations, one with 1 thread and one with 16 threads on the workstation or 8 on the Mac. I also used BLAS.set_num_threads(N_threads). I additionally started using MKL, which didn’t make much of a difference, unfortunately.

The run times are in the table below. As you can see, they are all more or less the same, so we’re back to square zero:

The multithreaded code is as fast as the sequential code on the WS
Parallelism cripples the Mac for some reason, but I am more worried about debugging the workstation’s performance.
The tiny M1 Mac Mini runs the code as fast as the workstation, even when using 16 threads on the workstation and 1 on the Mac.

	WS 1t	WS 16t	M1 1t	M1 8t
t (s)	327	338	333	546

I was monitoring htop the whole time, and for the single threaded simulation, one thread stayed at 100% the whole time. On the other hand, in the multithreaded calculation, all 16 threads jumped around between 0 and 50%, with an average overall CPU usage of around ~400-500%.

I think it makes sense the multithreaded and sequential versions have the same run time based on this data, accounting for the parallelism overhead.

ChrisRackauckas · January 26, 2022, 10:50am

Could you post an annotated profile?

ash · January 26, 2022, 11:20am

Embarrassingly, I don’t know much about profiling and don’t know how to annotate one.

However, I profiled some code on both systems and tried UMF vs KLU. There is a weird bug where @profview from VSCode does not capture the true time it takes to run the function vs @time. Thus, the M1 results below aren’t reliable until I figure out how to profile it more properly.

	WS	M1
Total	59	39 (55)
`KLU.LibKLU.klu_l_factor`	46.1	33.1
`KLU.LibKLU.klu_l_solve`	8.7	0.8

	WS	M1
Total`	32	14 (33)
`SuiteSparse.UMFPACK.solve!`	20.9	1.8
`SuiteSparse.UMFPACK.#umfpack_numeric!#13`	4.9	2.6
`SuiteSparse.UMFPACK.umfpack_symbolic`!	2.22	2.2

The number between parentheses for the M1 is the actual run time if I use @time. So basically both systems perform identically now with UMF being almost twice as fast.

Unfortunately, I don’t think this data is very useful due to the M1 bug, but at least it can tell us more about the WS.

EDIT: all tests were done with 1 thread and OpenBLAS.

ChrisRackauckas · January 26, 2022, 11:30am

shows how to profile. Profile a small solve. Don’t optimize what the profile doesn’t show.

Topic		Replies	Views
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5454	January 31, 2022
JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster) General Usage jump , diffeq , apple	12	2758	December 5, 2021
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36477	June 19, 2020
Why doesn't multithreading help here? Performance	12	1415	August 22, 2020
Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? Performance linearalgebra	50	4617	April 7, 2022

Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

Related topics