Parallelizing multiple Crank–Nicolson solvers

tkf · March 10, 2021, 10:42am

FLoops.jl makes it easy to, e.g., re-use w in the same task but across different iterations. It could be a bit nicer for cache. But that’s really a micro-optimization and I’m not sure if’d matter.

A sanity check you can do is to check the overhead of the task scheduling with @btime wait(@spawn nothing) and Threads.@threads for x in 1:Threads.nthreads(); end in (say) julia -t41 in your machines. Actually, it looks like the overhead is pretty close to the timing you shared in the plot. So maybe the saturation in the performance is unavoidable.

julia> Threads.nthreads()
41

julia> g(xs) = Threads.@threads for x in xs; end
g (generic function with 1 method)

julia> @btime g(1:Threads.nthreads())
  71.039 μs (206 allocations: 18.38 KiB)

Edit: I wasw making an incorrect argument here. Click this to figure out my mistake :)

Original argument:

FYI, it looks like FLoops is actually bit faster for starting and synchronizing loops, presumably because it uses divide-and-conquer scheduling and synching, as opposed to the sequential strategy used by @threads:

julia> f(xs) = @floop ThreadedEx() for x in xs; end
f (generic function with 1 method)

julia> @btime f(1:Threads.nthreads())
  53.349 μs (244 allocations: 18.88 KiB)

But I don’t think this matters much here.

Edit: Above argument was actually not correct. First of all, the difference like this is Julia version specific. For example, I don’t see this in 1.5. Also, this comparison was not fair. ThreadedEx can choose not to spawn tasks in multiple OS threads but @threads does so by design. Furthermore, when the loop is no-op like this, it may be faster to not use OS threads for all iterations.

Topic		Replies	Views
Best performance for initialising a variable number of matrices/vectors within a Crank-Nicolson scheme Performance question , solver	5	265	April 19, 2024
2D diffusion and crank nicolson Numerics modelling	0	305	May 28, 2023
SplitODEProblem Modelling & Simulations	5	266	June 17, 2023
Independent LU factorization of small matrices not faster with threads Performance question	10	705	October 5, 2020
Possible performance drop when using more than one socket threads Performance	2	346	May 29, 2021

Parallelizing multiple Crank–Nicolson solvers

Related topics