Thank you for the video, I watched (almost) all of it and definitely learned a lot. My ODE function was already nonallocating (thanks to you in another thread, actually).
Here is a flame graph for UMF, a short 10s run with left-heavy mode (I don’t know how to read it otherwise honestly).
Both were executed sequentially on the WS with default OpenBLAS.
With what little knowledge I possess, I think it’s safe to conclude that my function is fine and takes up a fraction of the execution time. It is the linear solver that needs to be optimized for this code to work better, but I have no idea how to do that honestly.
Thank you for posting this, I really appreciate it. I am not surprised by this architecture’s growing pain. I will get some sleep and work on compiling Julia from source with the new version of OpenBLAS and report back.
Okay yes, that looks beautiful. So then did you play around with configurations of preconditioned GMRES like is described in Solving Large Stiff Equations · DifferentialEquations.jl? When it’s the linear solve that’s the limiting factor, the only things you can do are (a) try a different ODE solver or (b) try a different linear solver (or (c) optimize the stepping heuristics to your problem, but let’s ignore that one).
When PDEs get asymptoticly large, preconditioned Krylov will almost always be the best option. But Pardiso is worth a try too. With the AlgebraicMultigrid, the default Gauss-Seidel smoother has a bit of overhead so that tutorial shows using a Jacobi smoother speeds it up a bit. But tuning an ILU should be more efficient, you have to sit down and play with tau a bit though.
For the choice of ODE solver, KenCarp4 vs KenCarp47 vs TRBDF2 vs CVODE_BDF vs QNDF vs FBDF vs Rodas5 (last one only with preconditioned GMRES, not with KLU) is somewhat of a wash and depends on your problem. I would time a few things.
Then if you really want to get extreme, you can think of splitting the problem so that you solve it in IMEX form:
KenCarp4 is compatible with that, so you could say make the Jacobian be tridiagonal and do that implicitly while other things explicit. The exact splitting will be a bit of an art.
Thank you for the advice Chris, this worked amazingly well.
Using @btime, the original code (UMF + algebraicmultigrid) took 10.20s. Changing to KrylovJL_GMRES() it takes 5.293s. Changing from algebraicmultigrid to imcompletelu it takes 590 ms!. I played around with \tau, using values from 0.1 to 10,000 and the 50 from the example works perfectly. All the results above use tspan=(0.0,1.0).
I am sticking to KenCarp4 since I plan to split the equation ASAP and will post the results.
Here’s the profile for the most optimized case, running it 100 times. I am still not very good at reading these graphs so let me know if you catch anything.
I can tell for example that there’s a lot of time spent copying (copyto!) and broadcasting (materialize!). BLAS.axpy! and BLAS.dot are also significant, so I can work on optimizing BLAS.
The next step for now is splitting the ODE and using an IMEX like you suggested. Then I will work on compiling OpenBLAS as suggested by @ImreSamu with a build optimized for AlderLake and AVX2 and we’ll see what happens.
This is fun and so educational!
EDIT: It is worth noting that the optimized code at this stage runs on the M1 Mac in 700 ms, a speedup of 16%. I should definitely be able to squeeze more performance out of this.
Turning on multithreading with 16 threads. I have an @floop over a nested loop with 254 elements per loop, I ignored the single loop as per @tkf’s suggestion. Run time decreased from 590 ms to 510 ms.
Turning on OpenBLAS multithreading (16 threads). Run time went up from 510 ms to 783 ms (2498 ms with 8 threads).
Switching between MKL and OpenBLAS in serial and parallel. No appreciable change in serial, MKL is 4 times slower in parallel!!
When running this in parallel, the average CPU usage is low, like 200-300% or so.
Disclaimer: I didn’t read the entire thread / didn’t follow too closely.
Did you benchmark everything with BLAS.set_num_threads(1)? If not, note that you need to take when using MKL and, at the same time, multiple Julia threads (same can be said about OpenBLAS). I find that MKL_NUM_THREADS defaults to the number of cores of the system and since this number is used per Julia thread you will readily have too many MKL threads running (and thus oversubscribe your cores). See Matrix multiplication is slower when multithreading in Julia - #12 by carstenbauer for more.
TLDR: I suggest you benchmark everything with BLAS.set_num_threads(1) (independent of whether you use MKL or OpenBLAS).
I am back with some new data. As per @ImreSamu’s advice, I compiled OpenBLAS v0.3.19 using the Flags below
This adds Alder Lake support with AVX/AVX2/AVX512 (not sure of the difference). I then compiled Julia with this version of OpenBLAS. The performance table is below:
# Threads
0.3.19 (Alder Lake Support, ms)
0.3.13 (Default, ms)
1
585
530
4
575
1193
8
571
2167
16
580
710
Here, # Threads is the # of threads used with OpenBLAS. As you can see, it has no effect on performance in 0.3.19 and significantly affects performance (more threads = worse performance) in the default version of OpenBLAS.
I think that the default OpenBLAS is insufficient if you are doing parallel workloads on a 12th Gen, but 0.3.19 should be supported sooner or later. I am also not sure why the time doesn’t go down with the thread count in 0.3.19, it remains completely static.
I have benchmarked with both 1 and 16 threads in the posts above. I just read that post, it is very helpful, thank you. What do you think about the data above where different versions of OpenBLAS behave differently? Do you think it’s just the optimization for the architecture?
I just wanted to add that the difference in performance between @floops and @inbounds in the loop here is a factor of 2 faster with 16 threads over 1 thread. So parallelism is definitely partially working now, it’s just slower than I expected.
One more note is that if you do what I did and compile v0.3.19 for Alder Lake support, you will lose access to Sundials.jl. It is compiled against libblas64_.so, the default, so it won’t work with a custom blas. See here.
Looking at the profile you posted above, it looks like you spend most of the time outside of GS_Neumann0! and most of the work is in the ODE solver. It’s expected that parallelizing the derivative computation won’t help much. See: Amdahl's law - Wikipedia
I agree with you, and I think I am done trying to parallelize the function. Thanks for referencing Amdahl’s law, I hadn’t heard of it. Here’s some data.
I managed to partially fix the @tturbo issues so now I have access to 5 macros for the loops: @tturbo, @turbo, @avx, @floop, @inbounds. I put these into a matrix where the rows refer to the double loop and the columns refer to the four single loops. For example, cell (1,2) (91) means with the double loop vectorized with @tturbo and the single loop with @turbo the run time is 91 \mu s.
Updated code here. The results show @btime @btime GS_Neumann0!(rand(N,N,2), rand(N,N,2), [ones(6); Float64(N)], 0.0). This code still doesn’t run with OrdinaryDiffEq.solve(), it throws a warning LoopVectorization.check_args on your inputs failed; running fallback @inbounds @fastmath loop instead.
All tests were run with 16 Julia threads and 16 BLAS threads using the Alder Lake BLAS I mentioned above.
@tturbo
@turbo
@inbounds
@floop
@avx
@tturbo
90
91
90
-
89
@turbo
-
170
171
-
-
@inbounds
-
-
202
-
-
@floop
-
126
127
198
-
@avx
-
-
124
211
-
It can be observed that:
Parallelizing the single loop with N = 256 does virtually nothing, as @tkf expected in an earlier comment.
@tturbo/@tturbo performs best on the double loop, a bit better than @floop/@turbo. This is my choice for now (if I or someone else can fix that warning, It’s @floop@inbounds for now)
The performance boost over serial code (@inbounds/@inbounds) is about 2x, as I mentioned before.
Running the ODE Solver with the improved code gives
I think it is safe to conclude that I’ve done all I can for this function (unless someone disagrees) and it is time to focus on the solver. Next step: ODE splitting
BTW, I don’t know if you want to hear this after extensive testing :), but I think GitHub - mcabbott/Tullio.jl: ⅀ can also be used for code like you have.
I just looked at the code you posted but it looks like you don’t use @inboundsinside@floop. Since @floop is a very generic loop framework, it doesn’t put @inbounds like @tturbo does. So, you’d need @inbound annotation inside of the loop as in
@floop ex for j in 2:N-1, i in 2:N-1
@inbound begin
du[i,j,1] = ...
You can also pass ThreadedEx(simd = Val(:ivdep)) for this type of computation (which is what LoopVectorization.jl assumes IIRC).
Thank you, I wasn’t aware of this package, I will see if can integrate this into my code.
Both fixed! New code pushed to repo, I used ThreadedEx(simd = Val(true)). Barely any change in performance, unfortunately, but as you said, I’ve probably done everything I can for this function. Thank you so much for all the help! I will focus on the ODE solver as per your advice.
Anyway, it turns out KenCarp47 is the fastest solver for this problem, with IterativeSovlersJL_GMRES() as the linear solver and NLAnderson() as the nonlinear solver. Overall, I’m currently using:
Back to the parallelism, as @carstenbauer said, 1 BLAS thread is the right call. See here for a discussion I just started.
Here are the results so far:
Starting Code
Current Code
t = 100, mu mode
60
0.538
t = 10000, beta mode
2250
47
\mu-mode and \beta-mode are two solutions of the equations. The former reaches a steady-state after some t > 100 while the latter oscillates forever. You can see that the performance improvement for the non-oscillating simulation is 112x and 48x for the oscillating mode. Not bad.
Things to figure out:
The overclocking made absolutely no difference, if you look at the other thread I linked about OpenBLAS you’ll see I have relatively low CPU usage. How do I improve this?
Splitting the equation as @ChrisRackauckas suggested. I have already written the code but simply need to optimize it.