Parallelizing multiple Crank–Nicolson solvers

Well, A and B are not sparse, they are dense matrices, nr by nl, and each pair of columns is associated with a separate complex-symmetric tridiagonal matrix h_i of dimension nr by nr.

Did you turn off BLAS threading? You can check it with BLAS.get_num_threads() and set it with BLAS.set_num_threads(1). Or just environment variable OMP_NUM_THREADS=1.

I don’t think BLAS is used here since the matrices are SymTridiagonal.

4 Likes

That should not matter since the multiplication and solve steps are not in BLAS, but in pure Julia:

2 Likes

I think an important thing for me to understand is the threading scheduler; is Threads.@threads no-thrills, plain even-distribution of work over available threads, like #pragma omp for in OpenMP? For this kind of problem, would FLoops.jl make a difference? Where can I expect overhead? I’ve gathered that Threads.@spawn introduces more overhead, and is more suited for non-deterministic problems (?).

1 Like

Yes

You can potentially have some false sharing since you are writing to parts of memory on different threads that is close to each other. What happens if you use:

A = [rand(ComplexF64, nr) for i in 1:nl]
B = [zeros(ComplexF64, nr) for i in 1:nl]

function expv!(B, cs::ACollectionOfCranks, A)
    Threads.@threads for j = 1:size(A,2)
        expv!(B[j], cs.c[j], A[j])
    end
end

?

2 Likes

Interesting, but weird (all times faster, but slow-down instead of speedup with more threads):

However, I really need to store them as columns of dense matrices, for other parts of the propagator to work. I would be very interested if one can use the data locality to speed up the propagation, i.e. the tiling in Tullio.jl does this, I believe.

1 Like

FLoops.jl makes it easy to, e.g., re-use w in the same task but across different iterations. It could be a bit nicer for cache. But that’s really a micro-optimization and I’m not sure if’d matter.

A sanity check you can do is to check the overhead of the task scheduling with @btime wait(@spawn nothing) and Threads.@threads for x in 1:Threads.nthreads(); end in (say) julia -t41 in your machines. Actually, it looks like the overhead is pretty close to the timing you shared in the plot. So maybe the saturation in the performance is unavoidable.

julia> Threads.nthreads()
41

julia> g(xs) = Threads.@threads for x in xs; end
g (generic function with 1 method)

julia> @btime g(1:Threads.nthreads())
  71.039 μs (206 allocations: 18.38 KiB)
Edit: I wasw making an incorrect argument here. Click this to figure out my mistake :)

Original argument:

FYI, it looks like FLoops is actually bit faster for starting and synchronizing loops, presumably because it uses divide-and-conquer scheduling and synching, as opposed to the sequential strategy used by @threads:

julia> f(xs) = @floop ThreadedEx() for x in xs; end
f (generic function with 1 method)

julia> @btime f(1:Threads.nthreads())
  53.349 μs (244 allocations: 18.88 KiB)

But I don’t think this matters much here.


Edit: Above argument was actually not correct. First of all, the difference like this is Julia version specific. For example, I don’t see this in 1.5. Also, this comparison was not fair. ThreadedEx can choose not to spawn tasks in multiple OS threads but @threads does so by design. Furthermore, when the loop is no-op like this, it may be faster to not use OS threads for all iterations.

1 Like

That could very well be the case; if I run many steps of the propagator, and simultanously monitor using
perf top -F 30 -r 0, I find this

  28.52%  [kernel]                  [k] acpi_idle_do_entry
  18.84%  libjulia-internal.so.1.7  [.] get_next_task
  12.64%  libjulia-internal.so.1.7  [.] jl_generate_fptr
   9.65%  [kernel]                  [k] acpi_processor_ffh_cstate_enter
   2.53%  libjulia-internal.so.1.7  [.] jl_safepoint_wait_gc
   1.10%  libjulia-internal.so.1.7  [.] jl_task_get_next
   0.62%  [JIT] tid 29853           [.] 0x00007f068b040c6c
   0.62%  [JIT] tid 29853           [.] 0x00007f068b040c81
   0.59%  [JIT] tid 29853           [.] 0x00007f068b040c66
   0.47%  [JIT] tid 29853           [.] 0x00007f068b028951
   0.38%  [JIT] tid 29853           [.] 0x00007f068b040c7d
   0.34%  [JIT] tid 29853           [.] 0x00007f068b040c85
   0.33%  [JIT] tid 29853           [.] 0x00007f068b040c60
   0.32%  [JIT] tid 29853           [.] 0x00007f068b028955
   0.31%  [JIT] tid 29853           [.] 0x00007f068b040c72
   0.30%  [kernel]                  [k] native_sched_clock
   0.28%  [JIT] tid 29853           [.] 0x00007f068b040c77
   0.25%  [JIT] tid 29853           [.] 0x00007f068b040dca
   0.24%  [JIT] tid 29853           [.] 0x00007f068b028731

where the first column is “Overhead”.

EDIT:

julia> Threads.nthreads()
41
julia> @btime g(1:Threads.nthreads())
  101.481 μs (206 allocations: 18.38 KiB)
1 Like

Do I understand it correctly that I am limited due to Threads.@threads using task-based parallelism instead of data parallelism, where the latter would be more applicable to my problem?

1 Like

I’d say both @threads and @floop are data-parallel and @spawn is task-parallel. I think the main problem here is that the task spawn/wait overhead is too close to the compute time of expv!(w, cn::CrankNicolson, v). If you, for example, call expv!(B, cs::ACollectionOfCranks, A) repeatedly from a sequential outer loop with the arguments of the same size, it may be possible to amortize the spawn/wait cost by re-using the tasks (it’d be tricky but maybe doable).

1 Like

(FYI: I just noticed that my argument that @floop was faster than @threads in the above comment was actually incorrect. It’s fixed now. Benchmarking is hard…)

1 Like

I’m afraid that approach would not work, since it is part of a larger split-propagator:

V = \exp(C/2)\exp(B/2)\exp(A)\exp(B/2)\exp(C/2) U

and the time-stepping loop is outside this, i.e. you swap U and V and repeat the above until you’re finished. You also have to perform the exponentials in a fixed order (since the operations do not commute), so you basically have to thread each exponential separately. The good thing about it, is that the work is entirely deterministic, i.e. you know which operations \exp(C/2) will do, every iteration. Furthermore, since it is almost exclusively linear algebra, I think this is a good candidate for the fork–join approach, rather than starting tasks and wait, but here I must confess I don’t really know what I am talking about anymore :slight_smile:.

Anecdotally, I’m told (by a rather experienced programmer), that the above problem scales linearly, if not superlinearly, with the number of threads in Fortran+OpenMP, so I guess there is a difference in the threading approach (that I do not yet fully understand).

1 Like

If you have five exponentials, I wonder if you can call five expv!(_, ::CrankNicolson, _) in sequence, inside the body of @threads for loop? That’d reduce the number of spawns and waits so maybe you can stretch the scaling of the speedup a bit.

They are not all Crank–Nicolson, only the centre one is. The others are completely different (and more involved), but I wanted to start with parallelizing the simplest kernel.

1 Like

Did you check that there is zero allocation during the computing ?

Are the data allocated sequentially ? I remember that it can be beneficial that each computing cores allocate the data that they will eventually work with.

I think that an actual Fortran+OpenMP benchmark would be a very interesting starting point.
The largest machine has 24 cores and I would be interested to see if HT can bring a lot in compute bound problem like this one.

In addition, speed-up with very large number of threads on a single node is quite challenging : a lot of memory collisions can happen in the cache hierarchy, thread affinity, NUMA effects and so on.

Out of curiosity, have you considered GPU computing for this task ?

2 Likes

All very good points; I try preallocate all temporary storage arrays I need, but I have still not been able to make my propagator allocation-free. This of course is an important point to address. When I want thread a section that needs temporaries, I usually do something like

tmps = [Vector(...) for i in 1:Threads.nthreads()]

Threads.@threads for ....
    tmp = tmps[Threads.threadid()]
end

but I guess maybe you referred to something else by “each computing core allocate the data they will eventually work with”?

I am by no means well-versed in Fortran (or OpenMP) for that matter, but I will see what I can come up with (and asking help from the programmer I know). For the 2*24 cores => 96 threads of the Epyc machine, I was told that AMD’s implementation of their correspondent to HT was very efficient, and since my problem usually fits in L2 cache, it’s basically a vector machine for all intents and purposes.

All the problems you mention with SMP are very relevant for my problem, I believe, but something I would like to understand and be able to handle. It would be awesome if Julia could help me here.

I have been thinking about GPUs earlier, and parts of the propagator would work well there, I believe, but I am not sure about the whole.

1 Like

I would first ensure zero allocation because I have seen (not really proved) that allocations can be really speed-up killers.
Edit: There is no allocation on your expv! example (at least on my machine/version…).

It is exactly what I had in mind. The only problem is that Julia does not give (AFAIK) a chance to control thread affinity so a same thread may jump from one core to the other.

I think that a C/C++ + openmp should perform equally if you prefer those over Fortran. The actual compiler and openMP implementation (e.g. Intel) may play a larger role than language (C++ ou Fortran).

HT allows fast context switch reducing latency issues but if cores are fully used it should make no difference (AFAIU).

This is indeed a key point. L2 is not shared between AMD’s cores ?

I agree 100% !!!
I am unsure about that right now and I guess that other lower level language approaches (e.g. C++ with TBB) are still faster and offer a better control and lower latencies than Julia. Although those alternatives do not seriously care about // composability and, consequently, aim only at explicit MT parallelism. I have seen different claims on discourse so it should be investigated.

Therefore, an honest comparison with other approaches (Fortran, C++,…) would be very beneficial to evaluate strength and weaknesses of Julia’s MT.

May worth an investigation. My experience is that, while GPU programming can be cumbersome when you need to write many kernels, the speed-ups are much more easy to evaluate and predict than with MT programming on CPUs (lower level and better control, no “cache” coherency issues,…). From this respect Julia is really good at programming GPUs (CUDA.jl, oneAPI.jl) because you can rely on standard collective operations (broadcast, map, mapreduce,…) for large part of codes and allows you to focus on the complicated kernels.

2 Likes

Not completely related to the original question on MT scaling, but on the specific example (multiple Tridiagonal systems with the same (nc=3200) size to be solved. You can achieve a faster solution with Thomas algorithm (without pre-factorization) applied in a simd manner on blocks of systems (you have to adapt the layout). You can see these slides and this paper. Note that this part at least can be easily ported to GPU.

2 Likes

This seems like a highly relevant discussion: Overhead of `Threads.@threads` - #29 by Elrod

2 Likes