Is it possible to avoid these excess allocations when using @batch or @threads?

Tetrakai · November 3, 2024, 9:47pm

It took some effort to come up with a short MWE for this issue. The goal is to avoid a few small allocations when mutating (an argument of an inner function) in parallel:

MWE

using BenchmarkTools, Parameters, Polyester, .Threads

# Composite-type struct
@with_kw mutable struct Str
    A   :: String        = "A"
    res :: Vector{Int64} = fill(0, 10)
end


### Inner Function ###
function addone(x, idx) 
    x[idx] + 1
end


### Outer Functions ###
# Sequential
function foo(n, str)
    for i in 1:n
        if i >= n
            str.res[i] = addone(str.res, i)
        end
    end
end


# Polyester.@batch
function foobatch(n, str)
    @batch for i in 1:n
        if i >= n
            str.res[i] = addone(str.res, i)
        end
    end
end

# Threads.@threads
function foothreads(n, str)
    @threads for i in 1:n
        if i >= n
            str.res[i] = addone(str.res, i)
        end
    end
end


### Benchmarks ###
str = Str()
@btime foo(10,        x) setup = (x = deepcopy($str)) evals = 1
@btime foobatch(10,   x) setup = (x = deepcopy($str)) evals = 1
@btime foothreads(10, x) setup = (x = deepcopy($str)) evals = 1


### Check Results ###
foo(10,        str); str
foobatch(10,   str); str
foothreads(10, str); str

Benchmarks:

julia> @btime foo(10,        x) setup = (x = deepcopy($str)) evals = 1
  20.000 ns (0 allocations: 0 bytes)

julia> @btime foobatch(10,   x) setup = (x = deepcopy($str)) evals = 1
  1.410 μs (1 allocation: 32 bytes)

julia> @btime foothreads(10, x) setup = (x = deepcopy($str)) evals = 1
  34.601 μs (162 allocations: 16.81 KiB)

All functions mutate the (mutable) struct correctly:

Results

julia> foo(10,        str); str
Str
  A: String "A"
  res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


julia> foobatch(10,   str); str
Str
  A: String "A"
  res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 2]


julia> foothreads(10, str); str
Str
  A: String "A"
  res: Array{Int64}((10,)) [0, 0, 0, 0, 0, 0, 0, 0, 0, 3]

The issue appears to be related to this topic, which never got fully resolved:

Is it possible to avoid these? Using @threads would be ideal for my purposes, since @batch is slower when nested inside another parallel loop. But even the latter would be very informative/useful.

Gianluca_Fuwa · November 3, 2024, 10:25pm

I am not sure if there is a way to reduce the allocations with @threads unfortunately, but when using @batch allocations occur when non-isbits structs are inside the loop body, because it can not convert those to PtrArrays (at least as far as I understood it, but @Elrod probably knows more about that).

Therefore, in this case you can eliminate the single allocation by creating an alias to the Vector in str before the loop and using that one inside it:

function foobatch(n, str)
    res = str.res

    @batch for i in 1:n
        if i >= n
            res[i] = addone(res, i)
        end
    end
end

Tetrakai · November 3, 2024, 11:12pm

Thanks, that would explain it. Unfortunately the real code uses (roughly) an MVector of structs that are in turn of more structs. Those are finally isbits types (eg, SVectors), mostly.

The parallel loop is selecting elements of the MVector to pass on, so I don’t see how this solution could work without changing that and hurting the readability.

I’ll mark it as solved and maybe try rearranging things armed with this new knowledge.

Tetrakai · November 8, 2024, 10:24am

Do I understand correctly that a mutable struct is always non-isbits (even if all fields are isbits)? Thus I cannot avoid that small allocation when passing one to @batch.

carstenbauer · November 8, 2024, 10:37am

help?> isbitstype(A)
  isbitstype(T)

  Return true if type T is a "plain data" type, meaning it is immutable and contains no references to other values, only primitive types and other isbitstype types. [...]

So, yes.

Topic		Replies	Views
Allocations of @threads Performance memory-allocation	10	2334	August 28, 2021
Mutable structs seem cause way more allocations in multithreaded code Performance multithreading , mutable-structure	2	112	April 17, 2025
Type-instability because of @threads boxing variables Performance parallel	23	1519	March 25, 2022
--track-allocation and @threads General Usage question , multithreading	10	765	August 1, 2022
Are there any plans to improve the composability of @batch and/or reduce overhead of @threads? Performance question , multithreading , memory-allocation	2	129	November 5, 2024

Is it possible to avoid these excess allocations when using @batch or @threads?

Related topics