Is it possible to avoid these excess allocations when using @batch or @threads?

It took some effort to come up with a short MWE for this issue. The goal is to avoid a few small allocations when mutating (an argument of an inner function) in parallel:

MWE
using BenchmarkTools, Parameters, Polyester, .Threads

# Composite-type struct
@with_kw mutable struct Str
    A   :: String        = "A"
    res :: Vector{Int64} = fill(0, 10)
end


### Inner Function ###
function addone(x, idx) 
    x[idx] + 1
end


### Outer Functions ###
# Sequential
function foo(n, str)
    for i in 1:n
        if i >= n
            str.res[i] = addone(str.res, i)
        end
    end
end


# Polyester.@batch
function foobatch(n, str)
    @batch for i in 1:n
        if i >= n
            str.res[i] = addone(str.res, i)
        end
    end
end

# Threads.@threads
function foothreads(n, str)
    @threads for i in 1:n
        if i >= n
            str.res[i] = addone(str.res, i)
        end
    end
end


### Benchmarks ###
str = Str()
@btime foo(10,        x) setup = (x = deepcopy($str)) evals = 1
@btime foobatch(10,   x) setup = (x = deepcopy($str)) evals = 1
@btime foothreads(10, x) setup = (x = deepcopy($str)) evals = 1


### Check Results ###
foo(10,        str); str
foobatch(10,   str); str
foothreads(10, str); str

Benchmarks:

julia> @btime foo(10,        x) setup = (x = deepcopy($str)) evals = 1
  20.000 ns (0 allocations: 0 bytes)

julia> @btime foobatch(10,   x) setup = (x = deepcopy($str)) evals = 1
  1.410 μs (1 allocation: 32 bytes)

julia> @btime foothreads(10, x) setup = (x = deepcopy($str)) evals = 1
  34.601 μs (162 allocations: 16.81 KiB)

All functions mutate the (mutable) struct correctly:

Results
julia> foo(10,        str); str
Str
  A: String "A"
  res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


julia> foobatch(10,   str); str
Str
  A: String "A"
  res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 2]


julia> foothreads(10, str); str
Str
  A: String "A"
  res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 3]

The issue appears to be related to this topic, which never got fully resolved:

Is it possible to avoid these? Using @threads would be ideal for my purposes, since @batch is slower when nested inside another parallel loop. But even the latter would be very informative/useful.

I am not sure if there is a way to reduce the allocations with @threads unfortunately, but when using @batch allocations occur when non-isbits structs are inside the loop body, because it can not convert those to PtrArrays (at least as far as I understood it, but @Elrod probably knows more about that).

Therefore, in this case you can eliminate the single allocation by creating an alias to the Vector in str before the loop and using that one inside it:

function foobatch(n, str)
    res = str.res

    @batch for i in 1:n
        if i >= n
            res[i] = addone(res, i)
        end
    end
end
1 Like

Thanks, that would explain it. Unfortunately the real code uses (roughly) an MVector of structs that are in turn of more structs. Those are finally isbits types (eg, SVectors), mostly.

The parallel loop is selecting elements of the MVector to pass on, so I don’t see how this solution could work without changing that and hurting the readability.

I’ll mark it as solved and maybe try rearranging things armed with this new knowledge.

Do I understand correctly that a mutable struct is always non-isbits (even if all fields are isbits)? Thus I cannot avoid that small allocation when passing one to @batch.

help?> isbitstype(A)
  isbitstype(T)

  Return true if type T is a "plain data" type, meaning it is immutable and contains no references to other values, only primitive types and other isbitstype types. [...]

So, yes.

1 Like