Strange allocations during broadcasting

Can you please explain me why in the following code the second broadcasting results in 3 allocations?

using BenchmarkTools

u = rand(100)
k1 = similar(u)
k2 = similar(u)
k3 = similar(u)
k4 = similar(u)

@btime @. $u = $u + 1 * ($k1 + $k2 + $k3)
@btime @. $u = $u + 1 * ($k1 + $k2 + $k3 + $k4)
@btime @. $u = $u + ($k1 + $k2 + $k3 + $k4)
  100.137 ns (0 allocations: 0 bytes)
  143.974 ns (3 allocations: 80 bytes)
  127.223 ns (0 allocations: 0 bytes)

What is more confusing for me is that if I swap the last two lines, there will be no allocations.
Thank you.

1 Like

Up.

Meanwhile, in Julia 1.4 there are already 4 allocations in the above code.
Just to highlight the problem, the following code

using BenchmarkTools
# import DifferentialEquations

function foo(u, k1, k2, k3, k4)
    for i=1:1000
        @. u = u + 1.0 * (k1 + k2 + k3 + k4)
    end
    return nothing
end

function main()
    u = rand(100)
    k1 = zero(u)
    k2 = zero(u)
    k3 = zero(u)
    k4 = zero(u)

    @btime foo($u, $k1, $k2, $k3, $k4)
end

main()

results in

155.048 μs (4000 allocations: 93.75 KiB)

Surprisingly, if I import DifferentialEquations (uncomment the second line), the result will be

122.759 μs (0 allocations: 0 bytes)

That is, there are no more allocations.

The above code is typical for various Runge-Kutta methods. So far in my realizations of Runge-Kutta methods of order higher than 3 I inevitably see these broadcasting allocations. Can you, please, help me to understand what is going on. Especially, why the import of DifferentialEquations solves the problem.

It seems there is some heuristic that is getting hit because removing the +k4 makes it stop allocating.

Do you have any ideas where I can start to dig in?

Doing

julia> function foo(u, k1, k2, k3, k4)
           @. u = u + 1.0 * (k1 + k2 + k3 + k4)
           return nothing
       end
foo (generic function with 2 methods)

julia> @code_warntype optimize=true foo(u, k1, k2, k3, k4)

I see a

8 ─── %24  = invoke Base.Broadcast.combine_axes(1.0::Float64, %2::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),NTuple{4,Array{Float64,1}}})::Tuple{Union{Base.OneTo{Int64}, UnitRange{Int64}},Vararg{Any,N} where N}

which is not inferred to a concrete type. That’s the place I would probably start poking at.

Probably due to our broadcast fixes, i.e. https://github.com/SciML/DiffEqBase.jl/blob/master/src/diffeqfastbc.jl#L18-L19

1 Like

Do you have a PR that upstreams those?

There wasn’t before, but https://github.com/JuliaLang/julia/pull/35260 is it. Someone should probably double check that those changes are sufficient to get rid of these allocations.

It seems this is not the case:

using BenchmarkTools

import Base.Broadcast: Broadcasted
@inline combine_axes(A, B) = broadcast_shape(broadcast_axes(A), broadcast_axes(B))
@inline check_broadcast_axes(shp, A::Union{Number, Array, Broadcasted}) = check_broadcast_shape(shp, axes(A))

function foo(u, k1, k2, k3, k4)
    for i=1:1000
        @. u = u + 1.0 * (k1 + k2 + k3 + k4)
    end
    return nothing
end

function main()
    u = rand(100)
    k1 = zero(u)
    k2 = zero(u)
    k3 = zero(u)
    k4 = zero(u)

    @btime foo($u, $k1, $k2, $k3, $k4)
end

main()

177.050 μs (4000 allocations: 93.75 KiB)

You’re shadowing and not overloading:

@inline Base.Broadcast.combine_axes(A, B) = broadcast_shape(broadcast_axes(A), broadcast_axes(B))
@inline Base.Broadcast.check_broadcast_axes(shp, A::Union{Number, Array, Broadcasted}) = check_broadcast_shape(shp, axes(A))
1 Like

Thank you. Now it works:

using BenchmarkTools

@inline Base.Broadcast.combine_axes(A, B) = Base.Broadcast.broadcast_shape(Base.Broadcast.broadcast_axes(A), Base.Broadcast.broadcast_axes(B))

function foo(u, k1, k2, k3, k4)
    for i=1:1000
        @. u = u + 1.0 * (k1 + k2 + k3 + k4)
    end
    return nothing
end

function main()
    u = rand(100)
    k1 = zero(u)
    k2 = zero(u)
    k3 = zero(u)
    k4 = zero(u)

    @btime foo($u, $k1, $k2, $k3, $k4)
end

main()
136.010 μs (0 allocations: 0 bytes)

If it is not hard, can you give a brief explanation on what this patch do?

Splatting can sometimes cause a tuple to allocate if everything doesn’t inline or the world doesn’t align its chakras correctly. This makes it so that way a common case doesn’t ever splat, so boom now it can’t be a problem.

1 Like

OK. Thank you.

I find it’s easiest to just do @eval Base.Broadcast begin ... end to paste in these sorts of patches interactively without editing. You can also use Revise, but that can be trickier and isn’t as obvious.

2 Likes