Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac

1.)
Only OpenBlas v0.3.19 ( Dec 19, 2021 release ) contains: “added cpu detection for Intel Alder Lake” - so probably not an optimal code is generated. ( Alder Lake == 12 Gen )

  • “+ added cpu detection for Intel Alder Lake”
  • “+ fixed cpuid detection for Apple M1 and improved performance”

Current Status:

  • Julia 1.7.1 == OpenBLAS v0.3.13
  • Julia Nightly == OpenBLAS v0.3.17

2.) side note: The default OpenBLAS not using the BIOS “enabled” AVX-512 with Gen12 ( with an ~ older BIOS )
But workaround exists : tune OpenBLAS as an AlderLakeAVX512 with building

  • -O3 -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16

EDIT:
And a “quick and dirty patch” : Propose optional CPUARCH target AlderLakeAVX512, as aliase of SapphireRapids · Issue #3490 · OpenMathLib/OpenBLAS · GitHub

--- cpuid_x86.c 2021-12-19 14:07:28.243783197 +0100
+++ cpuid_x86.cnew      2021-12-20 22:50:44.629040294 +0100
@@ -1495,6 +1495,10 @@
         switch (model) {
         case 7: // Alder Lake desktop
         case 10: // Alder Lake mobile
+         if(support_avx512_bf16())
+            return CPUTYPE_COOPERLAKE;
+          if(support_avx512())
+            return CPUTYPE_SKYLAKEX;
           if(support_avx2())
             return CPUTYPE_HASWELL;
           if(support_avx())
1 Like

Thank you for the video, I watched (almost) all of it and definitely learned a lot. My ODE function was already nonallocating (thanks to you in another thread, actually).

Here is a flame graph for UMF, a short 10s run with left-heavy mode (I don’t know how to read it otherwise honestly).

It is set up to highlight the ODE function, GS_Neumann0. Here’s on for KLU.

Both were executed sequentially on the WS with default OpenBLAS.

With what little knowledge I possess, I think it’s safe to conclude that my function is fine and takes up a fraction of the execution time. It is the linear solver that needs to be optimized for this code to work better, but I have no idea how to do that honestly.

Thank you for posting this, I really appreciate it. I am not surprised by this architecture’s growing pain. I will get some sleep and work on compiling Julia from source with the new version of OpenBLAS and report back.

1 Like

Okay yes, that looks beautiful. So then did you play around with configurations of preconditioned GMRES like is described in Solving Large Stiff Equations · DifferentialEquations.jl? When it’s the linear solve that’s the limiting factor, the only things you can do are (a) try a different ODE solver or (b) try a different linear solver (or (c) optimize the stepping heuristics to your problem, but let’s ignore that one).

When PDEs get asymptoticly large, preconditioned Krylov will almost always be the best option. But Pardiso is worth a try too. With the AlgebraicMultigrid, the default Gauss-Seidel smoother has a bit of overhead so that tutorial shows using a Jacobi smoother speeds it up a bit. But tuning an ILU should be more efficient, you have to sit down and play with tau a bit though.

For the choice of ODE solver, KenCarp4 vs KenCarp47 vs TRBDF2 vs CVODE_BDF vs QNDF vs FBDF vs Rodas5 (last one only with preconditioned GMRES, not with KLU) is somewhat of a wash and depends on your problem. I would time a few things.

Then if you really want to get extreme, you can think of splitting the problem so that you solve it in IMEX form:

https://diffeq.sciml.ai/stable/types/split_ode_types/

https://diffeq.sciml.ai/stable/solvers/split_ode_solve/

KenCarp4 is compatible with that, so you could say make the Jacobian be tridiagonal and do that implicitly while other things explicit. The exact splitting will be a bit of an art.

1 Like

Thank you for the advice Chris, this worked amazingly well.

Using @btime, the original code (UMF + algebraicmultigrid) took 10.20s. Changing to KrylovJL_GMRES() it takes 5.293s. Changing from algebraicmultigrid to imcompletelu it takes 590 ms!. I played around with \tau, using values from 0.1 to 10,000 and the 50 from the example works perfectly. All the results above use tspan=(0.0,1.0).

I am sticking to KenCarp4 since I plan to split the equation ASAP and will post the results.

Here’s the profile for the most optimized case, running it 100 times. I am still not very good at reading these graphs so let me know if you catch anything.

I can tell for example that there’s a lot of time spent copying (copyto!) and broadcasting (materialize!). BLAS.axpy! and BLAS.dot are also significant, so I can work on optimizing BLAS.

The next step for now is splitting the ODE and using an IMEX like you suggested. Then I will work on compiling OpenBLAS as suggested by @ImreSamu with a build optimized for AlderLake and AVX2 and we’ll see what happens.

This is fun and so educational!

EDIT: It is worth noting that the optimized code at this stage runs on the M1 Mac in 700 ms, a speedup of 16%. I should definitely be able to squeeze more performance out of this.

3 Likes

This can be CPU dependent. We’ve actually been working through this today. See Krylov.jl optimizations for ODE usage · Issue #1563 · SciML/OrdinaryDiffEq.jl · GitHub. I would suggest doing using MKL: it matters a lot more on Ryzen then Intel apparently though.

1 Like

I just realized I WAS using MKL on the WS :sweat_smile:. Shouldn’t the calls be to MKL and not BLAS then?

I thought MKL was only optimized for Intel/only worked with Intel processors?

Note the pure Julia @tturbo is faster than MKL and OpenBLAS on Ryzen. That’s fun. :sweat_smile:

MKL is a BLAS implementation.

1 Like

I got a ton of help from the dev yesterday and still haven’t got @tturbo working, but I will fix it soon.

Thanks for the video, I know it’s a BLAS implementation but I didn’t realize the stack trace would still show BLAS.x instead of MKL.x.

I tried the following:

  1. Turning on multithreading with 16 threads. I have an @floop over a nested loop with 254 elements per loop, I ignored the single loop as per @tkf’s suggestion. Run time decreased from 590 ms to 510 ms.

  2. Turning on OpenBLAS multithreading (16 threads). Run time went up from 510 ms to 783 ms (2498 ms with 8 threads).

  3. Switching between MKL and OpenBLAS in serial and parallel. No appreciable change in serial, MKL is 4 times slower in parallel!!

When running this in parallel, the average CPU usage is low, like 200-300% or so.

Something funky is going on with BLAS and MKL.

Disclaimer: I didn’t read the entire thread / didn’t follow too closely.

Did you benchmark everything with BLAS.set_num_threads(1)? If not, note that you need to take when using MKL and, at the same time, multiple Julia threads (same can be said about OpenBLAS). I find that MKL_NUM_THREADS defaults to the number of cores of the system and since this number is used per Julia thread you will readily have too many MKL threads running (and thus oversubscribe your cores). See Matrix multiplication is slower when multithreading in Julia - #12 by carstenbauer for more.

TLDR: I suggest you benchmark everything with BLAS.set_num_threads(1) (independent of whether you use MKL or OpenBLAS).

I am back with some new data. As per @ImreSamu’s advice, I compiled OpenBLAS v0.3.19 using the Flags below

This adds Alder Lake support with AVX/AVX2/AVX512 (not sure of the difference). I then compiled Julia with this version of OpenBLAS. The performance table is below:

# Threads 0.3.19 (Alder Lake Support, ms) 0.3.13 (Default, ms)
1 585 530
4 575 1193
8 571 2167
16 580 710

Here, # Threads is the # of threads used with OpenBLAS. As you can see, it has no effect on performance in 0.3.19 and significantly affects performance (more threads = worse performance) in the default version of OpenBLAS.

I think that the default OpenBLAS is insufficient if you are doing parallel workloads on a 12th Gen, but 0.3.19 should be supported sooner or later. I am also not sure why the time doesn’t go down with the thread count in 0.3.19, it remains completely static.

I have benchmarked with both 1 and 16 threads in the posts above. I just read that post, it is very helpful, thank you. What do you think about the data above where different versions of OpenBLAS behave differently? Do you think it’s just the optimization for the architecture?

Next Steps:

  1. Split the ODE
  2. Try @tturbo again.
1 Like

I just wanted to add that the difference in performance between @floops and @inbounds in the loop here is a factor of 2 faster with 16 threads over 1 thread. So parallelism is definitely partially working now, it’s just slower than I expected.

One more note is that if you do what I did and compile v0.3.19 for Alder Lake support, you will lose access to Sundials.jl. It is compiled against libblas64_.so, the default, so it won’t work with a custom blas. See here.

Looking at the profile you posted above, it looks like you spend most of the time outside of GS_Neumann0! and most of the work is in the ODE solver. It’s expected that parallelizing the derivative computation won’t help much. See: Amdahl's law - Wikipedia

1 Like

I agree with you, and I think I am done trying to parallelize the function. Thanks for referencing Amdahl’s law, I hadn’t heard of it. Here’s some data.

I managed to partially fix the @tturbo issues so now I have access to 5 macros for the loops: @tturbo, @turbo, @avx, @floop, @inbounds. I put these into a matrix where the rows refer to the double loop and the columns refer to the four single loops. For example, cell (1,2) (91) means with the double loop vectorized with @tturbo and the single loop with @turbo the run time is 91 \mu s.

Updated code here. The results show @btime @btime GS_Neumann0!(rand(N,N,2), rand(N,N,2), [ones(6); Float64(N)], 0.0). This code still doesn’t run with OrdinaryDiffEq.solve(), it throws a warning LoopVectorization.check_args on your inputs failed; running fallback @inbounds @fastmath loop instead.

All tests were run with 16 Julia threads and 16 BLAS threads using the Alder Lake BLAS I mentioned above.

@tturbo @turbo @inbounds @floop @avx
@tturbo 90 91 90 - 89
@turbo - 170 171 - -
@inbounds - - 202 - -
@floop - 126 127 198 -
@avx - - 124 211 -

It can be observed that:

  1. Parallelizing the single loop with N = 256 does virtually nothing, as @tkf expected in an earlier comment.
  2. @tturbo/@tturbo performs best on the double loop, a bit better than @floop/@turbo. This is my choice for now (if I or someone else can fix that warning, It’s @floop @inbounds for now)
  3. The performance boost over serial code (@inbounds/@inbounds) is about 2x, as I mentioned before.
  4. Running the ODE Solver with the improved code gives

I think it is safe to conclude that I’ve done all I can for this function (unless someone disagrees) and it is time to focus on the solver. Next step: ODE splitting

BTW, I don’t know if you want to hear this after extensive testing :), but I think GitHub - mcabbott/Tullio.jl: ⅀ can also be used for code like you have.

I just looked at the code you posted but it looks like you don’t use @inbounds inside @floop. Since @floop is a very generic loop framework, it doesn’t put @inbounds like @tturbo does. So, you’d need @inbound annotation inside of the loop as in

@floop ex for j in 2:N-1, i in 2:N-1
  @inbound begin
    du[i,j,1] = ...

You can also pass ThreadedEx(simd = Val(:ivdep)) for this type of computation (which is what LoopVectorization.jl assumes IIRC).

1 Like

Thank you, I wasn’t aware of this package, I will see if can integrate this into my code.

Both fixed! New code pushed to repo, I used ThreadedEx(simd = Val(true)). Barely any change in performance, unfortunately, but as you said, I’ve probably done everything I can for this function. Thank you so much for all the help! I will focus on the ODE solver as per your advice.

I have spent the past few days heavily benchmarking: 1. ODE solvers, 2. Linear solvers, 3. Nonlinear solvers, 4. Preconditioners.

I also did the following:

  1. Overclocked my CPU from the default 3.6 GHz with 4.7 GHz turbo boost to 5.1 GHz fixed.
  2. Overclocked my memory to 5200CL36. This is less relevant but worth mentioning.
  3. Benchmarked different multithreading options (settled on @floop with fully merged loops as per @tkf 's suggestion, see updated source codehttps://github.com/oashour/PatternFormation.jl).

Anyway, it turns out KenCarp47 is the fastest solver for this problem, with IterativeSovlersJL_GMRES() as the linear solver and NLAnderson() as the nonlinear solver. Overall, I’m currently using:

KenCarp47(linsolve=IterativeSolversJL_GMRES(), precs=incompletelu, concrete_jac=true, nlsolve=NLAnderson())

Back to the parallelism, as @carstenbauer said, 1 BLAS thread is the right call. See here for a discussion I just started.

Here are the results so far:

Starting Code Current Code
t = 100, mu mode 60 0.538
t = 10000, beta mode 2250 47

\mu-mode and \beta-mode are two solutions of the equations. The former reaches a steady-state after some t > 100 while the latter oscillates forever. You can see that the performance improvement for the non-oscillating simulation is 112x and 48x for the oscillating mode. Not bad.

Things to figure out:

  1. The overclocking made absolutely no difference, if you look at the other thread I linked about OpenBLAS you’ll see I have relatively low CPU usage. How do I improve this?
  2. Splitting the equation as @ChrisRackauckas suggested. I have already written the code but simply need to optimize it.

Any more suggestions are welcome!

You could delete all of the linear solver stuff then: Anderson Acceleration does not use a linear solver.

In fact with these results you might want to try ROCK2 as the solver?

1 Like