Hidden allocations when parallelizing with @batch

Hello guys!

I’ve hit a bit of a roadblock and need a bit of help to overcome it. I’ve made a parallelized reduction function in which I am using Polyester.@batch. It works very well for 1 dimensional arrays, but now I am expanding it to work for StaticArrays. I have the following code:


using Base.Threads
using JET
using Bumper
using StrideArrays # Not necessary, but can make operations like broadcasting with Bumper.jl faster.
using Polyester
using BenchmarkTools
using LoopVectorization
using ChunkSplitters
using StaticArrays
using StructArrays

struct DimensionalData{D, T <: AbstractFloat}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end


function ReductionFunctionChunk!(dρdtI, I, J, drhopLp, drhopLn)
    XT = eltype(dρdtI); XL = length(dρdtI); X0 = zero(XT)
    nchunks = nthreads()  # Assuming nchunks is defined somewhere as nthreads()

    @inbounds @no_escape begin
        local_X = @alloc(XT, XL, nchunks)

        fill!(local_X,X0)

        # Directly iterate over the chunks
        @batch for ichunk in 1:nchunks
            chunk_inds = getchunk(I, ichunk; n=nchunks)
            for idx in chunk_inds
                i = I[idx]
                j = J[idx]

                # Accumulate the contributions into the correct place
                local_X[i, ichunk] += drhopLp[idx]
                local_X[j, ichunk] += drhopLn[idx]
            end
        end

        # Reduction step
        # using @tturbo is slightly faster than batch (28 mus), but @batch (30 mus) works for svector, so we prefer this.
        @batch for ix in 1:XL
            for chunk in 1:nchunks
                dρdtI[ix] += local_X[ix, chunk]
            end
        end
    end
    
    return nothing
end

function ReductionFunctionChunkNoBatch!(dρdtI, I, J, drhopLp, drhopLn)
    XT = eltype(dρdtI); XL = length(dρdtI); X0 = zero(XT)
    nchunks = nthreads()  # Assuming nchunks is defined somewhere as nthreads()

    @inbounds @no_escape begin
        local_X = @alloc(XT, XL, nchunks)

        fill!(local_X,X0)

        # Directly iterate over the chunks
        for ichunk in 1:nchunks
            chunk_inds = getchunk(I, ichunk; n=nchunks)
            for idx in chunk_inds
                i = I[idx]
                j = J[idx]

                # Accumulate the contributions into the correct place
                local_X[i, ichunk] += drhopLp[idx]
                local_X[j, ichunk] += drhopLn[idx]
            end
        end

        # Reduction step
        # using @tturbo is slightly faster than batch (28 mus), but @batch (30 mus) works for svector, so we prefer this.
        for ix in 1:XL
            for chunk in 1:nchunks
                dρdtI[ix] += local_X[ix, chunk]
            end
        end
    end
    
    return nothing
end


begin
    ProblemScaleFactor  = 1
    NumberOfPoints      = 6195*ProblemScaleFactor
    NumberOfInterations = 50000*ProblemScaleFactor
    I           = rand(1:NumberOfPoints, NumberOfInterations)
    J           = I; #rand(1:NumberOfPoints, NumberOfInterations)

    V           = zeros(SVector{2,Float64},NumberOfPoints)
    VL          = rand(eltype(V),NumberOfInterations)

    VD          = DimensionalData{2, Float64}(NumberOfPoints)
    VDL         = DimensionalData{2, Float64}(NumberOfInterations)
    VDL.V      .= VL


    V .*= 0
    ReductionFunctionChunk!(V,I,J,VL,VL)
    println("Value when doing chunk svector reduction: ", sum(V))

    println(@report_opt    ReductionFunctionChunk!(V,I,J,VL,VL))
    println(@report_call   ReductionFunctionChunk!(V,I,J,VL,VL))
    #println(@code_warntype ReductionFunctionChunk!(V,I,J,VL,VL))

    VD.V .*= 0
    ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V)
    println("Value when doing chunk svector reduction with dimensional data: ", sum(VD.V))


    println(@report_opt    ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V))
    println(@report_call   ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V))
    #println(@code_warntype ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V))

    VD.V .*= 0
    ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V)
    println("Value when doing chunk svector reduction with dimensional data and no @batch: ", sum(VD.V))


    println(@report_opt    ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V))
    println(@report_call   ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V))
    #println(@code_warntype ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V))

    println("Chunk function:")
    display(@benchmark ReductionFunctionChunk!($V,$I,$J,$VL,$VL))

    println("Chunk function with dimensional data:")
    display(@benchmark ReductionFunctionChunk!($(VD.V),$I,$J,$(VDL.V),$(VDL.V)))

    println("Chunk function with dimensional data and no @batch:")
    display(@benchmark ReductionFunctionChunkNoBatch!($(VD.V),$I,$J,$(VDL.V),$(VDL.V)))

end

It outputs the results as follows:

image

So all calculations are correct. In the benchmarking:

*Chunk function:*
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.900 μs … 437.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     55.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.356 μs ±   9.583 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                 ▃  ▂     █
  ▁▃▂▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂█▄▅█▅▆▄▂▄█▆▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁ ▂
  37.9 μs         Histogram: frequency by time         67.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

*Chunk function with dimensional data:*
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  48.000 μs … 388.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     78.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   72.682 μs ±  14.848 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▂                                      █ ▂▆█
  ██▂▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▆█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  48 μs           Histogram: frequency by time         90.9 μs <

 Memory estimate: 176 bytes, allocs estimate: 2.

*Chunk function with dimensional data and no @batch:*
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  157.500 μs … 562.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     163.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   178.801 μs ±  43.349 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄▅▄▅▄▃▂▁                                                    ▁
  ████████████▇▇▇▇▇▆▆▆▆▆▃▄▆▅▅▇▇▇▇▆▆▆▅▅▄▄▄▄▅▄▃▄▄▄▃▄▃▃▃▄▃▄▄▂▄▆▇██ █
  158 μs        Histogram: log(frequency) by time        402 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The first benchmark shows what happens when I use Vector{SVector{2,Float64}} directly. The second benchmark, allocates, and shows when I use a struct, DimensionalData, which holds a struct array with the field, V and type Vector{SVector{2,Float64}}. The third benchmark shows when I remove @batch that no allocations are happening.

Comparing the @code_warntype for the first and second benchmark:

Everything is the same other than what is shown above.

Could anyone kindly show me how to fix this, so that I can use DimensionalData without allocating?

Kind regards

2 Likes

Okay, I’ve found that if I do:

    println("Chunk function:")
    K  = convert(Vector{SVector{2,Float64}},VD.V)
    KL = convert(Vector{SVector{2,Float64}},VDL.V)
    display(@benchmark ReductionFunctionChunk!($K,$I,$J,$KL,$KL))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  38.300 μs … 359.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     54.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.532 μs ±  10.564 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅                        ▇▃▃   █  ▁
  ▇█▃▁▃▄▃▆▃▇▂▁▁▁▁▁▁▂▁▁▂▁▁▁▂▃███▅▄▇█▃▃█▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  38.3 μs         Histogram: frequency by time         74.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which makes it seem like some run-time dispatch is going on, where Polyester cannot properly resolve the type of StructArray:

julia> VD.V
6195-element StructArray(::Vector{Float64}, ::Vector{Float64}) with eltype SVector{2, Float64}:

Even though it has eltype SVector{2, Float64}

So no solution yet, but understood a bit more!

1 Like

I’m confused by your usage of Bumper in combination with @batch. Is it intentional to allocate X_local outside of the batch loop?

And if so, what’s the purpose? You could just allocate it normally?

Hi!

Yes, it is intentional. I am not really using Bumper in the way it should be used, I am simply using the buffer as a local storage, so that I do not have to pass in an extra array in the function definition.

Kind regards

I see. I assume you are calling these reduction functions multiple times in your application?

Yes, indeed.

I call them very often and that is why even the smallest allocation is important. I call this reduction function perhaps over 100000 times.

Kind regards

That looks like 3 parameters, and StructArray is documented to have 4 (check typeof(VD), also see the Int64 at the end of the StructVector, which should only hide the 1 parameter), so that looks like an abstract type. You could easily refactor to include that extra parameter. I’m pretty sure this does not affect your benchmark at all because VD.V provides the concrete StructArray type prior to the benchmark call. That is also consistent with the no-batch version’s no-allocations. It really does look like @batch and StructArray are interacting somehow to make those 2 small allocations.

1 Like

Hi!

This is the full type:

julia> typeof(VD.V)
StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64} (alias for StructArray{SArray{Tuple{2}, Float64, 1, 2}, 1, Tuple{Array{Float64, 1}, Array{Float64, 1}}, Int64})

Could you explain what you mean by refactoring?

And to the last line, yes exactly! And modifying the @batch is a bit tricky, I am not so much into macro programming, so I hope there is an easier fix while allowing me to use my data structure, which works great up till now.

Kind regards

V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}
#= to =#
V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}, I}
#= to match =#
StructArray{SArray{Tuple{2}, Float64, 1, 2}, 1, Tuple{Array{Float64, 1}, Array{Float64, 1}}, Int64}

Since Int64 doesn’t mirror any of the other parameters, I think you have to use a whole new parameter, which I called I to mirror struct StructArray{T, N, C<:Tup, I}. You would also need to add it at struct DimensionalData{D, T <: AbstractFloat, I} and the new call. Now that I look closer, I’m not sure how you would do new, actually, the straightforward way would be to do new{D, T, StructArrays.index_type(V)} like StructArray does, but index_type isn’t documented so I don’t know if that’s intended. Some equivalent is needed to make it feasible to put the parametric StructArrays in fields.

Does the @code_warntype show any poorly inferred types indicated by red, like in the Locals list or the method body? It’s also possible that the allocation isn’t from poor inference and it’s something @batch does. If you want to see how @batch transforms the loop, you could @macroexpand the expression, though I’m not sure that would explain much because Polyester’s README looks complicated.

I filed a bug report here:

I was unfortunately not able to find a solution :blush:

Kind regards

As @Benny has said, this will result in type instabilities. It should probably be

struct DimensionalData{D, T <: AbstractFloat}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}, Int}

If you still get allocations, then you may need to PR StrideArraysCore to add explicit support for StructArrays as a pkg extension. Basically, object_and_preserve should turn the inner vectors into PtrArrays

2 Likes

Thank you both @Elrod and @Benny !

Further specifying the type with “Int” at the end did not work.

In regards to extending StrideArraysCore I will give it a look. I think it is a bit above what I am able to just at the moment.

Kind regards

How did you write the struct definition, and what error was thrown by the overall program, exactly? Curious if the aforementioned index_type doesn’t guarantee an Int type for I.

1 Like

Hi!

Sorry for the delayed response. I don’t think I got an error as such, just the type instability mentioned before.

I ended up rewriting my code to not use StructArrays and just use StaticArrays directly. StructArrays has some nice features, but is not good with multithreaded code it seems.

Kind regards