FLoops.jl makes it easy to, e.g., re-use w
in the same task but across different iterations. It could be a bit nicer for cache. But that’s really a micro-optimization and I’m not sure if’d matter.
A sanity check you can do is to check the overhead of the task scheduling with @btime wait(@spawn nothing)
and Threads.@threads for x in 1:Threads.nthreads(); end
in (say) julia -t41
in your machines. Actually, it looks like the overhead is pretty close to the timing you shared in the plot. So maybe the saturation in the performance is unavoidable.
julia> Threads.nthreads()
41
julia> g(xs) = Threads.@threads for x in xs; end
g (generic function with 1 method)
julia> @btime g(1:Threads.nthreads())
71.039 μs (206 allocations: 18.38 KiB)
Edit: I wasw making an incorrect argument here. Click this to figure out my mistake :)
Original argument:
FYI, it looks like FLoops is actually bit faster for starting and synchronizing loops, presumably because it uses divide-and-conquer scheduling and synching, as opposed to the sequential strategy used by @threads
:
julia> f(xs) = @floop ThreadedEx() for x in xs; end
f (generic function with 1 method)
julia> @btime f(1:Threads.nthreads())
53.349 μs (244 allocations: 18.88 KiB)
But I don’t think this matters much here.
Edit: Above argument was actually not correct. First of all, the difference like this is Julia version specific. For example, I don’t see this in 1.5. Also, this comparison was not fair. ThreadedEx
can choose not to spawn tasks in multiple OS threads but @threads
does so by design. Furthermore, when the loop is no-op like this, it may be faster to not use OS threads for all iterations.