There is a certain overhead of using `Threads.@threads` even if only a single threads is used.
For example, running Julia v1.5.3 with julia --threads=1
yields
julia> using BenchmarkTools
julia> function foo_serial!(dest, src)
for i in eachindex(dest, src)
@inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
end
return nothing
end
julia> function foo_threaded!(dest, src)
Threads.@threads for i in eachindex(dest, src)
@inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
end
return nothing
end
julia> Threads.nthreads()
julia> src = rand(10^3); dest = similar(src);
julia> @benchmark foo_serial!($dest, $src)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 10.426 μs (0.00% GC)
median time: 10.532 μs (0.00% GC)
mean time: 10.653 μs (0.00% GC)
maximum time: 46.086 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark foo_threaded!($dest, $src)
BenchmarkTools.Trial:
memory estimate: 832 bytes
allocs estimate: 6
--------------
minimum time: 12.653 μs (0.00% GC)
median time: 12.929 μs (0.00% GC)
mean time: 13.255 μs (0.00% GC)
maximum time: 44.788 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Of course, the threaded version will become faster if multiple threads are used, e.g. using julia --threads=4
julia> using BenchmarkTools
julia> function foo_serial!(dest, src)
for i in eachindex(dest, src)
@inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
end
return nothing
end
julia> function foo_threaded!(dest, src)
Threads.@threads for i in eachindex(dest, src)
@inbounds dest[i] = sin(cos(src[i])) # proxy for some operation
end
return nothing
end
julia> Threads.nthreads()
julia> src = rand(10^3); dest = similar(src);
julia> @benchmark foo_serial!($dest, $src)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 10.048 μs (0.00% GC)
median time: 10.289 μs (0.00% GC)
mean time: 10.407 μs (0.00% GC)
maximum time: 29.898 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark foo_threaded!($dest, $src)
BenchmarkTools.Trial:
memory estimate: 2.98 KiB
allocs estimate: 21
--------------
minimum time: 5.005 μs (0.00% GC)
median time: 5.827 μs (0.00% GC)
mean time: 6.155 μs (1.51% GC)
maximum time: 333.063 μs (95.86% GC)
--------------
samples: 10000
evals/sample: 6
We are writing a library of numerical algorithms for partial differential equations that should be relatively easy to use for students and extensible for researchers. At the same time, the performance should not be too bad. Currently, we just use Threads.@threads
for loops where multithreading will give us improvements when using common problem sizes and multiple threads. However, we would also like to avoid the overhead of this approach when only one thread is used. Hence, we would like to know what you think will be the best approach for us.
- Just use
Threads.@threads for ...
as we do now and hope that the overhead will be reduced in the future. Are there any plans/roadmaps for that? - Write functions that can benefit from using threads twice, one version with
Threads.@threads
, the other version without, maybe using a macro internally to simplify things. There are different ways to realize something like this.- We could just check the number of threads at runtime, e.g. end up having something like
This is probably okay since the loops are usually more costly than a singlejulia> function foo_maybethreaded!(dest, src) if Threads.nthreads() == 1 foo_serial!(dest, src) else foo_threaded!(dest, src) end end
if
check. - We could add another type parameter to our discretization structs and use that to dispatch, e.g. something like
parallelization::Val{:serial}
orparallelization::Val{:threaded}
. - We could use functions returning
true
orfalse
and use these to decide whether to use threading or not globally in our package, similar toenable_debug_timing
/disable_debug_timing
in TimerOutputs.jl
- We could just check the number of threads at runtime, e.g. end up having something like
What would you propose?