Hidden allocations when parallelizing with @batch

Ahmed_Salih · March 2, 2024, 9:54am

Hello guys!

I’ve hit a bit of a roadblock and need a bit of help to overcome it. I’ve made a parallelized reduction function in which I am using Polyester.@batch. It works very well for 1 dimensional arrays, but now I am expanding it to work for StaticArrays. I have the following code:


using Base.Threads
using JET
using Bumper
using StrideArrays # Not necessary, but can make operations like broadcasting with Bumper.jl faster.
using Polyester
using BenchmarkTools
using LoopVectorization
using ChunkSplitters
using StaticArrays
using StructArrays

struct DimensionalData{D, T <: AbstractFloat}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end


function ReductionFunctionChunk!(dρdtI, I, J, drhopLp, drhopLn)
    XT = eltype(dρdtI); XL = length(dρdtI); X0 = zero(XT)
    nchunks = nthreads()  # Assuming nchunks is defined somewhere as nthreads()

    @inbounds @no_escape begin
        local_X = @alloc(XT, XL, nchunks)

        fill!(local_X,X0)

        # Directly iterate over the chunks
        @batch for ichunk in 1:nchunks
            chunk_inds = getchunk(I, ichunk; n=nchunks)
            for idx in chunk_inds
                i = I[idx]
                j = J[idx]

                # Accumulate the contributions into the correct place
                local_X[i, ichunk] += drhopLp[idx]
                local_X[j, ichunk] += drhopLn[idx]
            end
        end

        # Reduction step
        # using @tturbo is slightly faster than batch (28 mus), but @batch (30 mus) works for svector, so we prefer this.
        @batch for ix in 1:XL
            for chunk in 1:nchunks
                dρdtI[ix] += local_X[ix, chunk]
            end
        end
    end
    
    return nothing
end

function ReductionFunctionChunkNoBatch!(dρdtI, I, J, drhopLp, drhopLn)
    XT = eltype(dρdtI); XL = length(dρdtI); X0 = zero(XT)
    nchunks = nthreads()  # Assuming nchunks is defined somewhere as nthreads()

    @inbounds @no_escape begin
        local_X = @alloc(XT, XL, nchunks)

        fill!(local_X,X0)

        # Directly iterate over the chunks
        for ichunk in 1:nchunks
            chunk_inds = getchunk(I, ichunk; n=nchunks)
            for idx in chunk_inds
                i = I[idx]
                j = J[idx]

                # Accumulate the contributions into the correct place
                local_X[i, ichunk] += drhopLp[idx]
                local_X[j, ichunk] += drhopLn[idx]
            end
        end

        # Reduction step
        # using @tturbo is slightly faster than batch (28 mus), but @batch (30 mus) works for svector, so we prefer this.
        for ix in 1:XL
            for chunk in 1:nchunks
                dρdtI[ix] += local_X[ix, chunk]
            end
        end
    end
    
    return nothing
end


begin
    ProblemScaleFactor  = 1
    NumberOfPoints      = 6195*ProblemScaleFactor
    NumberOfInterations = 50000*ProblemScaleFactor
    I           = rand(1:NumberOfPoints, NumberOfInterations)
    J           = I; #rand(1:NumberOfPoints, NumberOfInterations)

    V           = zeros(SVector{2,Float64},NumberOfPoints)
    VL          = rand(eltype(V),NumberOfInterations)

    VD          = DimensionalData{2, Float64}(NumberOfPoints)
    VDL         = DimensionalData{2, Float64}(NumberOfInterations)
    VDL.V      .= VL


    V .*= 0
    ReductionFunctionChunk!(V,I,J,VL,VL)
    println("Value when doing chunk svector reduction: ", sum(V))

    println(@report_opt    ReductionFunctionChunk!(V,I,J,VL,VL))
    println(@report_call   ReductionFunctionChunk!(V,I,J,VL,VL))
    #println(@code_warntype ReductionFunctionChunk!(V,I,J,VL,VL))

    VD.V .*= 0
    ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V)
    println("Value when doing chunk svector reduction with dimensional data: ", sum(VD.V))


    println(@report_opt    ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V))
    println(@report_call   ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V))
    #println(@code_warntype ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V))

    VD.V .*= 0
    ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V)
    println("Value when doing chunk svector reduction with dimensional data and no @batch: ", sum(VD.V))


    println(@report_opt    ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V))
    println(@report_call   ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V))
    #println(@code_warntype ReductionFunctionChunkNoBatch!(VD.V,I,J,VDL.V,VDL.V))

    println("Chunk function:")
    display(@benchmark ReductionFunctionChunk!($V,$I,$J,$VL,$VL))

    println("Chunk function with dimensional data:")
    display(@benchmark ReductionFunctionChunk!($(VD.V),$I,$J,$(VDL.V),$(VDL.V)))

    println("Chunk function with dimensional data and no @batch:")
    display(@benchmark ReductionFunctionChunkNoBatch!($(VD.V),$I,$J,$(VDL.V),$(VDL.V)))

end

It outputs the results as follows:

So all calculations are correct. In the benchmarking:

*Chunk function:*
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.900 μs … 437.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     55.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.356 μs ±   9.583 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                 ▃  ▂     █
  ▁▃▂▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂█▄▅█▅▆▄▂▄█▆▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁ ▂
  37.9 μs         Histogram: frequency by time         67.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

*Chunk function with dimensional data:*
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  48.000 μs … 388.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     78.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   72.682 μs ±  14.848 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▂                                      █ ▂▆█
  ██▂▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▆█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  48 μs           Histogram: frequency by time         90.9 μs <

 Memory estimate: 176 bytes, allocs estimate: 2.

*Chunk function with dimensional data and no @batch:*
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  157.500 μs … 562.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     163.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   178.801 μs ±  43.349 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄▅▄▅▄▃▂▁                                                    ▁
  ████████████▇▇▇▇▇▆▆▆▆▆▃▄▆▅▅▇▇▇▇▆▆▆▅▅▄▄▄▄▅▄▃▄▄▄▃▄▃▃▃▄▃▄▄▂▄▆▇██ █
  158 μs        Histogram: log(frequency) by time        402 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The first benchmark shows what happens when I use Vector{SVector{2,Float64}} directly. The second benchmark, allocates, and shows when I use a struct, DimensionalData, which holds a struct array with the field, V and type Vector{SVector{2,Float64}}. The third benchmark shows when I remove @batch that no allocations are happening.

Comparing the @code_warntype for the first and second benchmark:

Everything is the same other than what is shown above.

Could anyone kindly show me how to fix this, so that I can use DimensionalData without allocating?

Kind regards

Ahmed_Salih · March 2, 2024, 11:10am

Okay, I’ve found that if I do:

    println("Chunk function:")
    K  = convert(Vector{SVector{2,Float64}},VD.V)
    KL = convert(Vector{SVector{2,Float64}},VDL.V)
    display(@benchmark ReductionFunctionChunk!($K,$I,$J,$KL,$KL))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  38.300 μs … 359.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     54.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.532 μs ±  10.564 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅                        ▇▃▃   █  ▁
  ▇█▃▁▃▄▃▆▃▇▂▁▁▁▁▁▁▂▁▁▂▁▁▁▂▃███▅▄▇█▃▃█▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  38.3 μs         Histogram: frequency by time         74.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which makes it seem like some run-time dispatch is going on, where Polyester cannot properly resolve the type of StructArray:

julia> VD.V
6195-element StructArray(::Vector{Float64}, ::Vector{Float64}) with eltype SVector{2, Float64}:

Even though it has eltype SVector{2, Float64}

So no solution yet, but understood a bit more!

carstenbauer · March 2, 2024, 11:54am

I’m confused by your usage of Bumper in combination with @batch. Is it intentional to allocate X_local outside of the batch loop?

And if so, what’s the purpose? You could just allocate it normally?

Ahmed_Salih · March 2, 2024, 11:56am

Hi!

Yes, it is intentional. I am not really using Bumper in the way it should be used, I am simply using the buffer as a local storage, so that I do not have to pass in an extra array in the function definition.

Kind regards

carstenbauer · March 2, 2024, 11:58am

I see. I assume you are calling these reduction functions multiple times in your application?

Ahmed_Salih · March 2, 2024, 12:00pm

Yes, indeed.

I call them very often and that is why even the smallest allocation is important. I call this reduction function perhaps over 100000 times.

Kind regards

Benny · March 2, 2024, 12:01pm

That looks like 3 parameters, and StructArray is documented to have 4 (check typeof(VD), also see the Int64 at the end of the StructVector, which should only hide the 1 parameter), so that looks like an abstract type. You could easily refactor to include that extra parameter. I’m pretty sure this does not affect your benchmark at all because VD.V provides the concrete StructArray type prior to the benchmark call. That is also consistent with the no-batch version’s no-allocations. It really does look like @batch and StructArray are interacting somehow to make those 2 small allocations.

Ahmed_Salih · March 2, 2024, 12:06pm

Hi!

This is the full type:

julia> typeof(VD.V)
StructVector{SVector{2, Float64}, Tuple{Vector{Float64}, Vector{Float64}}, Int64} (alias for StructArray{SArray{Tuple{2}, Float64, 1, 2}, 1, Tuple{Array{Float64, 1}, Array{Float64, 1}}, Int64})

Could you explain what you mean by refactoring?

And to the last line, yes exactly! And modifying the @batch is a bit tricky, I am not so much into macro programming, so I hope there is an easier fix while allowing me to use my data structure, which works great up till now.

Kind regards

Benny · March 2, 2024, 12:35pm

V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}
#= to =#
V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}, I}
#= to match =#
StructArray{SArray{Tuple{2}, Float64, 1, 2}, 1, Tuple{Array{Float64, 1}, Array{Float64, 1}}, Int64}

Since Int64 doesn’t mirror any of the other parameters, I think you have to use a whole new parameter, which I called I to mirror struct StructArray{T, N, C<:Tup, I}. You would also need to add it at struct DimensionalData{D, T <: AbstractFloat, I} and the new call. Now that I look closer, I’m not sure how you would do new, actually, the straightforward way would be to do new{D, T, StructArrays.index_type(V)} like StructArray does, but index_type isn’t documented so I don’t know if that’s intended. Some equivalent is needed to make it feasible to put the parametric StructArrays in fields.

Does the @code_warntype show any poorly inferred types indicated by red, like in the Locals list or the method body? It’s also possible that the allocation isn’t from poor inference and it’s something @batch does. If you want to see how @batch transforms the loop, you could @macroexpand the expression, though I’m not sure that would explain much because Polyester’s README looks complicated.

Ahmed_Salih · March 2, 2024, 1:38pm

I filed a bug report here:

github.com/JuliaSIMD/Polyester.jl

Bug: StructArrays allocate using @batch even if underlying type is equivalent

opened 11:45AM - 02 Mar 24 UTC

AhmedSalih3d

Hello! I noticed that there is a difference between feeding in `Vector{SVecto…r{2,Float64}}` if it is packed in a `StructArray` or if it is "raw". My findings are here: ```julia using Base.Threads using Bumper using StrideArrays # Not necessary, but can make operations like broadcasting with Bumper.jl faster. using Polyester using BenchmarkTools using ChunkSplitters using StaticArrays using StructArrays struct DimensionalData{D, T <: AbstractFloat} vectors::Tuple{Vararg{Vector{T}, D}} V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}} # General constructor for vectors function DimensionalData(vectors::Vector{T}...) where {T} D = length(vectors) V = StructArray{SVector{D, T}}(vectors) new{D, T}(Tuple(vectors), V) end # Constructor for initializing with all zeros, adapting to dimension D function DimensionalData{D, T}(len::Int) where {D, T} vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros V = StructArray{SVector{D, T}}(vectors) new{D, T}(vectors, V) end end function ReductionFunctionChunk!(dρdtI, I, J, drhopLp, drhopLn) XT = eltype(dρdtI); XL = length(dρdtI); X0 = zero(XT) nchunks = nthreads() # Assuming nchunks is defined somewhere as nthreads() @inbounds @no_escape begin local_X = @alloc(XT, XL, nchunks) fill!(local_X,X0) # Directly iterate over the chunks @batch for ichunk in 1:nchunks chunk_inds = getchunk(I, ichunk; n=nchunks) for idx in chunk_inds i = I[idx] j = J[idx] # Accumulate the contributions into the correct place local_X[i, ichunk] += drhopLp[idx] local_X[j, ichunk] += drhopLn[idx] end end # Reduction step # using @tturbo is slightly faster than batch (28 mus), but @batch (30 mus) works for svector, so we prefer this. @batch for ix in 1:XL for chunk in 1:nchunks dρdtI[ix] += local_X[ix, chunk] end end end return nothing end begin ProblemScaleFactor = 1 NumberOfPoints = 6195*ProblemScaleFactor NumberOfInterations = 50000*ProblemScaleFactor I = rand(1:NumberOfPoints, NumberOfInterations) J = I; #rand(1:NumberOfPoints, NumberOfInterations) V = zeros(SVector{2,Float64},NumberOfPoints) VL = rand(eltype(V),NumberOfInterations) VD = DimensionalData{2, Float64}(NumberOfPoints) VDL = DimensionalData{2, Float64}(NumberOfInterations) VDL.V .= VL V .*= 0 ReductionFunctionChunk!(V,I,J,VL,VL) println("Value when doing chunk svector reduction: ", sum(V)) VD.V .*= 0 ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V) println("Value when doing chunk svector reduction with dimensional data: ", sum(VD.V)) println("Chunk function:") display(@benchmark ReductionFunctionChunk!($V,$I,$J,$VL,$VL)) println("Chunk function with struct:") K = VD.V KL = VDL.V @benchmark ReductionFunctionChunk!($K,$I,$J,$KL,$KL) end ``` ```julia Value when doing chunk svector reduction: [50079.66957585603, 50105.60196762815] Value when doing chunk svector reduction with dimensional data: [50079.66957585603, 50105.60196762815] Chunk function: BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 37.800 μs … 434.800 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 54.200 μs ┊ GC (median): 0.00% Time (mean ± σ): 51.479 μs ± 9.509 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▁ █ ▁█▇▃▁▁▂▃▁▃▃▂▄▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▆██▃▁▂▆▇▆▄▆▂▃▂▂▂▂▂▂▂▂▁▁▁▁▁ ▂ 37.8 μs Histogram: frequency by time 65 μs < Memory estimate: 0 bytes, allocs estimate: 0. Chunk function with struct: BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 47.900 μs … 355.600 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 77.800 μs ┊ GC (median): 0.00% Time (mean ± σ): 68.857 μs ± 15.964 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅█▁▃▂▁ ▄▆█▇▆▅▃▂▂▁▁ ▂ ███████▇███▇█▆▇▆▅▄▆▄▄▅▄▅▃▄▄▇▆▆█▇████▆▇████████████▇▇▆▆▅▆▅▄▄▂ █ 47.9 μs Histogram: log(frequency) by time 92.7 μs < Memory estimate: 176 bytes, allocs estimate: 2. ``` My findings are as follows: The 3 * 64 bytes, seem to come from indexing operations. By uncommenting an index operation inside of `ReductionFunctionChunk!` the amount of memory usage goes down. More importantly, when removing `@batch` everything goes to zero allocations as expected. Kind regards

I was unfortunately not able to find a solution

Kind regards

Elrod · March 2, 2024, 7:54pm

As @Benny has said, this will result in type instabilities. It should probably be

struct DimensionalData{D, T <: AbstractFloat}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}, Int}

If you still get allocations, then you may need to PR StrideArraysCore to add explicit support for StructArrays as a pkg extension. Basically, object_and_preserve should turn the inner vectors into PtrArrays

github.com

JuliaSIMD/StrideArraysCore.jl/blob/main/src/thread_compatible.jl


@generated function _object_and_preserve(x::T) where {T}
  # q = if sizeof(T) > 64
  # Expr(:block, :(rx = $(ManualMemory.Reference)(x)), :((rx, rx)))
  # elseif isbitstype(T)
  q = isbitstype(T) ? :((x, nothing)) : :((x, x))
  Expr(:block, Expr(:meta, :inline), q)
end
@inline object_and_preserve(x) = _object_and_preserve(x)

@inline object_and_preserve(x::String) = (x, x)
@inline object_and_preserve(A::AbstractArray{T}) where {T<:NativeTypes} =
  array_object_and_preserve(ArrayInterface.device(A), A)
@inline function object_and_preserve(A::AbstractArray{T}) where {T}
  if isbitstype(T) && ArrayInterface.device(A) === ArrayInterface.CPUPointer()
    (PtrArray(A), preserve_buffer(A))
  else
    (A, A)
  end
end

This file has been truncated. show original

Ahmed_Salih · March 4, 2024, 10:31am

Thank you both @Elrod and @Benny !

Further specifying the type with “Int” at the end did not work.

In regards to extending StrideArraysCore I will give it a look. I think it is a bit above what I am able to just at the moment.

Kind regards

Benny · March 4, 2024, 11:11am

How did you write the struct definition, and what error was thrown by the overall program, exactly? Curious if the aforementioned index_type doesn’t guarantee an Int type for I.

Ahmed_Salih · March 5, 2024, 10:35pm

Hi!

Sorry for the delayed response. I don’t think I got an error as such, just the type instability mentioned before.

I ended up rewriting my code to not use StructArrays and just use StaticArrays directly. StructArrays has some nice features, but is not good with multithreaded code it seems.

Kind regards

Topic		Replies	Views
Using Polyester.jl with ChunkSplitters.jl Performance multithreading	4	473	January 11, 2024
Is it possible to avoid these excess allocations when using @batch or @threads? Performance memory-allocation , threads , closure , polyester	4	153	November 8, 2024
Are there any plans to improve the composability of @batch and/or reduce overhead of @threads? Performance question , multithreading , memory-allocation	2	128	November 5, 2024
Threading, Threads.@threads vs polyester.@batch vs LoopVectorization.@tturbo General Usage multithreading , threads , loopvectorization , polyester	4	1725	July 23, 2022
Nesting `Threads.@thread` and `Polyester.@batch` (or context manager to limit Polyester threads) Specific Domains multithreading , polyester	7	471	April 5, 2023

Hidden allocations when parallelizing with @batch

Related topics