Thanks everyone for their suggestions!
Needs to know what propagate splitting is actually doing
Well, not exactly something trivial, but also not something crazy: initialize a model for a quantum system and a complex vector of size 1024, and then do a whole lot of FFT/iFFT, vector-vector-multiplication (or rather Diagonal matrix - vector multiplication) and elementwise exponentiation.
I put the full code for the benchmark here: GitHub - goerz-testing/2023-01_rotating_tai_benchmark: Benchmarking of parallelization for rotating TAI
It’s “minimal”, but still somewhat complicated. Maybe I can do a self-contained single file that does something bogus but structurally similar to the original code. You should be able to run it by just instantiating the project, as described in the README.
Threads.@threads is applied only to the
M loop, which is 13 iterations.
One thread doing 6 and another doing 7 should still get much better speedup than you’re seeing.
Yeah, but I’m generally OK with that. Actually, a potentially larger issue is that the different calls to
propagate_splitting have a significantly different runtime, and I think potentially all “short” runtimes end up in one thread and all “long” runtimes in another. To avoid both of these issues, I tried settting
N=M=16 and I’m also using a constant value for the input arrays, that that every call actually does the exact same thing. This is “benchmark 2”. This brings out an even worse scaling:
:> JULIA_EXCLUSIVE=1 julia --project=. -t 1 benchmark2.jl
153.666410 seconds (290.36 M allocations: 410.962 GiB, 5.05% gc time, 0.04% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 2 benc
98.435518 seconds (309.35 M allocations: 411.444 GiB, 6.31% gc time, 0.08% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 4 benchmark2.jl
99.298586 seconds (333.82 M allocations: 412.088 GiB, 5.35% gc time, 0.08% compilation time)
:> JULIA_EXCLUSIVE=1 julia --project=. -t 8 benchmark2.jl
121.498563 seconds (374.74 M allocations: 413.286 GiB, 3.96% gc time, 0.07% compilation time)
It gets slower again with more than 2 threads!
What about if you replace
propagates_splitting with a sleep to simulate work? Any difference?
Yeah, I tried that as “benchmark 5” in the test repo, and the scaling is perfect in that case.
So yes, the “problem” is what’s happening inside
propagate_splitting, but my question is: why should it matter? It’s a self-contained black box function, accessesing non-overlapping data on each thread. It all should be completely independent.
The only thing I could imagine is something like saturating memory bandwidth… but that seems very bad Julia behavior for this kind of code. What I mean: I’ve never had a problem like that doing similar kinds of numerics by “trivially” parallelizing with OpenMP in Fortran.
I mean, maybe I need to go in and optimize allocations. They seem a bit high in the benchmark, but not outrageous. Although the point of the question here isn’t necessarily to optimize to propagation code, at this point. This is kinda throw-away “daily research” code, and I’m just trying to reduce the time I have to wait for some calculations by a few hours by utilizing the cores in my workstation. I do wish Julia had manual memory management like Fortran, though!
Perhaps it’s not a problem here, but in general don’t put
@inbounds in front of
@inbounds actually don’t make any difference at all. I didn’t have any
@inbounds initially and then added them later just to make sure. And then I left them in under the assumption that otherwise someone would say “Try adding
The most important part is assigning
t_r outside of the inner loop
There could be some false sharing
Worth a try, but didn’t make a difference (results are in the test repo).
Just to get some more information, what processor are you using?
An Intel Core-i9 with 8 cores (16 hyperthreads) running Linux. See the “System Information” in the example repo for more details.