Strange allocations during broadcasting

fedoroff · July 21, 2019, 2:05pm

Can you please explain me why in the following code the second broadcasting results in 3 allocations?

using BenchmarkTools

u = rand(100)
k1 = similar(u)
k2 = similar(u)
k3 = similar(u)
k4 = similar(u)

@btime @. $u = $u + 1 * ($k1 + $k2 + $k3)
@btime @. $u = $u + 1 * ($k1 + $k2 + $k3 + $k4)
@btime @. $u = $u + ($k1 + $k2 + $k3 + $k4)

  100.137 ns (0 allocations: 0 bytes)
  143.974 ns (3 allocations: 80 bytes)
  127.223 ns (0 allocations: 0 bytes)

What is more confusing for me is that if I swap the last two lines, there will be no allocations.
Thank you.

fedoroff · March 25, 2020, 1:49pm

Up.

Meanwhile, in Julia 1.4 there are already 4 allocations in the above code.
Just to highlight the problem, the following code

using BenchmarkTools
# import DifferentialEquations

function foo(u, k1, k2, k3, k4)
    for i=1:1000
        @. u = u + 1.0 * (k1 + k2 + k3 + k4)
    end
    return nothing
end

function main()
    u = rand(100)
    k1 = zero(u)
    k2 = zero(u)
    k3 = zero(u)
    k4 = zero(u)

    @btime foo($u, $k1, $k2, $k3, $k4)
end

main()

results in

155.048 μs (4000 allocations: 93.75 KiB)

Surprisingly, if I import DifferentialEquations (uncomment the second line), the result will be

122.759 μs (0 allocations: 0 bytes)

That is, there are no more allocations.

The above code is typical for various Runge-Kutta methods. So far in my realizations of Runge-Kutta methods of order higher than 3 I inevitably see these broadcasting allocations. Can you, please, help me to understand what is going on. Especially, why the import of DifferentialEquations solves the problem.

kristoffer.carlsson · March 25, 2020, 4:05pm

It seems there is some heuristic that is getting hit because removing the +k4 makes it stop allocating.

fedoroff · March 25, 2020, 4:33pm

Do you have any ideas where I can start to dig in?

kristoffer.carlsson · March 25, 2020, 5:20pm

Doing

julia> function foo(u, k1, k2, k3, k4)
           @. u = u + 1.0 * (k1 + k2 + k3 + k4)
           return nothing
       end
foo (generic function with 2 methods)

julia> @code_warntype optimize=true foo(u, k1, k2, k3, k4)

I see a

8 ─── %24  = invoke Base.Broadcast.combine_axes(1.0::Float64, %2::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),NTuple{4,Array{Float64,1}}})::Tuple{Union{Base.OneTo{Int64}, UnitRange{Int64}},Vararg{Any,N} where N}

which is not inferred to a concrete type. That’s the place I would probably start poking at.

ChrisRackauckas · March 25, 2020, 5:31pm

Probably due to our broadcast fixes, i.e. https://github.com/SciML/DiffEqBase.jl/blob/master/src/diffeqfastbc.jl#L18-L19

mbauman · March 25, 2020, 5:48pm

Do you have a PR that upstreams those?

ChrisRackauckas · March 25, 2020, 5:59pm

There wasn’t before, but https://github.com/JuliaLang/julia/pull/35260 is it. Someone should probably double check that those changes are sufficient to get rid of these allocations.

fedoroff · March 25, 2020, 5:59pm

It seems this is not the case:

using BenchmarkTools

import Base.Broadcast: Broadcasted
@inline combine_axes(A, B) = broadcast_shape(broadcast_axes(A), broadcast_axes(B))
@inline check_broadcast_axes(shp, A::Union{Number, Array, Broadcasted}) = check_broadcast_shape(shp, axes(A))

function foo(u, k1, k2, k3, k4)
    for i=1:1000
        @. u = u + 1.0 * (k1 + k2 + k3 + k4)
    end
    return nothing
end

function main()
    u = rand(100)
    k1 = zero(u)
    k2 = zero(u)
    k3 = zero(u)
    k4 = zero(u)

    @btime foo($u, $k1, $k2, $k3, $k4)
end

main()

177.050 μs (4000 allocations: 93.75 KiB)

ChrisRackauckas · March 25, 2020, 6:02pm

You’re shadowing and not overloading:

@inline Base.Broadcast.combine_axes(A, B) = broadcast_shape(broadcast_axes(A), broadcast_axes(B))
@inline Base.Broadcast.check_broadcast_axes(shp, A::Union{Number, Array, Broadcasted}) = check_broadcast_shape(shp, axes(A))

fedoroff · March 25, 2020, 6:08pm

Thank you. Now it works:

using BenchmarkTools

@inline Base.Broadcast.combine_axes(A, B) = Base.Broadcast.broadcast_shape(Base.Broadcast.broadcast_axes(A), Base.Broadcast.broadcast_axes(B))

function foo(u, k1, k2, k3, k4)
    for i=1:1000
        @. u = u + 1.0 * (k1 + k2 + k3 + k4)
    end
    return nothing
end

function main()
    u = rand(100)
    k1 = zero(u)
    k2 = zero(u)
    k3 = zero(u)
    k4 = zero(u)

    @btime foo($u, $k1, $k2, $k3, $k4)
end

main()

136.010 μs (0 allocations: 0 bytes)

If it is not hard, can you give a brief explanation on what this patch do?

ChrisRackauckas · March 25, 2020, 6:15pm

Splatting can sometimes cause a tuple to allocate if everything doesn’t inline or the world doesn’t align its chakras correctly. This makes it so that way a common case doesn’t ever splat, so boom now it can’t be a problem.

fedoroff · March 25, 2020, 6:18pm

OK. Thank you.

mbauman · March 25, 2020, 6:20pm

I find it’s easiest to just do @eval Base.Broadcast begin ... end to paste in these sorts of patches interactively without editing. You can also use Revise, but that can be trickier and isn’t as obvious.

Topic		Replies	Views
Strange allocations in broadcasting (with `^`?) Performance performance , potential-bug	3	482	December 1, 2021
Allocations in broadcasted assignment (StackOverflow question) Performance broadcast , memory-allocation	2	248	February 4, 2024
Why does broadcast operation allocate? Performance	3	1262	July 28, 2021
Unexpected Memory Allocation of Broadcasting Copy Performance question	4	122	June 15, 2024
Unexpected allocations in looped vs broadcasted functions on tuples of arrays Performance	4	505	March 12, 2020

Strange allocations during broadcasting

Related topics