It took some effort to come up with a short MWE for this issue. The goal is to avoid a few small allocations when mutating (an argument of an inner function) in parallel:
MWE
using BenchmarkTools, Parameters, Polyester, .Threads
# Composite-type struct
@with_kw mutable struct Str
A :: String = "A"
res :: Vector{Int64} = fill(0, 10)
end
### Inner Function ###
function addone(x, idx)
x[idx] + 1
end
### Outer Functions ###
# Sequential
function foo(n, str)
for i in 1:n
if i >= n
str.res[i] = addone(str.res, i)
end
end
end
# Polyester.@batch
function foobatch(n, str)
@batch for i in 1:n
if i >= n
str.res[i] = addone(str.res, i)
end
end
end
# Threads.@threads
function foothreads(n, str)
@threads for i in 1:n
if i >= n
str.res[i] = addone(str.res, i)
end
end
end
### Benchmarks ###
str = Str()
@btime foo(10, x) setup = (x = deepcopy($str)) evals = 1
@btime foobatch(10, x) setup = (x = deepcopy($str)) evals = 1
@btime foothreads(10, x) setup = (x = deepcopy($str)) evals = 1
### Check Results ###
foo(10, str); str
foobatch(10, str); str
foothreads(10, str); str
Benchmarks:
julia> @btime foo(10, x) setup = (x = deepcopy($str)) evals = 1
20.000 ns (0 allocations: 0 bytes)
julia> @btime foobatch(10, x) setup = (x = deepcopy($str)) evals = 1
1.410 μs (1 allocation: 32 bytes)
julia> @btime foothreads(10, x) setup = (x = deepcopy($str)) evals = 1
34.601 μs (162 allocations: 16.81 KiB)
All functions mutate the (mutable) struct correctly:
Results
julia> foo(10, str); str
Str
A: String "A"
res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
julia> foobatch(10, str); str
Str
A: String "A"
res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
julia> foothreads(10, str); str
Str
A: String "A"
res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 3]
The issue appears to be related to this topic, which never got fully resolved:
Is it possible to avoid these? Using @threads
would be ideal for my purposes, since @batch
is slower when nested inside another parallel loop. But even the latter would be very informative/useful.