Compiler Can't Optimize Away Unnecessary Memory Allocs with Convenience Variables

ejmeitz · April 28, 2023, 3:08pm

I have code doing some broadcasted operations and it takes ~2 microseconds which is great

ΔU2 += sum((Δuᵢ .* u .* @view F2[:,i]) .+ (Δuⱼ .* u .* @view F2[:,j]) .+ (Δuₖ .* u .* @view F2[:,k]))

However, this is entirely unreadable in my opinion. My original code had extra variables defined for some parts of this operation:

uᵢrow = Δuᵢ .* u
uⱼrow = Δuⱼ .* u
uₖrow = Δuₖ .* u

These variables are never used again in the function and were only created for readability. If I benchmark the two cases the first case where I just put everything in one line is faster because the compiler did not have to allocate new memory for the variables. Is this an intended feature of the compiler or just an existing limitation? is there some way to have these variables for readability without taking a performance hit?

SteffenPL · April 28, 2023, 3:50pm

You could use the macros @views and @. to avoid repeated view macros and broadcasting dots.

Maybe this would fit your style

sum(@views @. u * ( (Δuᵢ  * F2[:,i]) + (Δuⱼ * F2[:,j]) .+ (Δuₖ * F2[:,k])))

What you propose is a kind of lazy evaluation of arrays. There are some packages which do that, like GitHub - JuliaArrays/LazyArrays.jl: Lazy arrays and linear algebra in Julia which provides a @~ macro to denote lazy evaluation.

ejmeitz · April 28, 2023, 3:59pm

Yeah that is definitely prettier and good enough for what I’m doing.

Does the Julia compiler not check stuff like this though? My experience with C++ compilers is that they make this optimization for the user.

Sukera · April 28, 2023, 4:01pm

a = b .* c semantically has to materialize the broadcast into a result, in this case an Array. Since creating those crosses an FFI barrier (unfortunately), optimizing them is a bit harder than usual so that the allocation is moved to the stack instead of being tracked by GC.

mikmoore · May 1, 2023, 3:49pm

You can create un-materialized broadcasting (“lazy broadcasting”) via the Broadcast.broadcasted function. Note that this functionality appears to be internal so this usage may break at some future time, although it is used as an integral part of lowering (see @code_lowered sin.(x) .* 2, for example), so there’s a good chance it remains mostly stable.

Unlike Iterators.map, the resulting Broadcast.Broadcasted object should fuse with further broadcast operations.

julia> x = zeros(5); z = similar(x);

julia> ym = Iterators.map(sin, x)
Base.Generator{Vector{Float64}, typeof(sin)}(sin, [0.0, 0.0, 0.0, 0.0, 0.0])

julia> yb = Broadcast.broadcasted(sin, x)
Base.Broadcast.Broadcasted(sin, ([0.0, 0.0, 0.0, 0.0, 0.0],))

julia> using BenchmarkTools

julia> @btime $z .= $ym .* 2;
  38.042 ns (1 allocation: 96 bytes)

julia> @btime $z .= $yb .* 2;
  14.729 ns (0 allocations: 0 bytes)

lmiq · May 1, 2023, 5:38pm

You can also write a loop, which might be clearer, and intermediate scalars will be optimized out.

mikmoore · May 1, 2023, 6:34pm

From the look of your initial expression, it appears that your broadcast-based expression might be doing redundant work. It’s impossible to say without knowing what types your variables actually are.

I’ll assume F2 is a 2-d array. Here are some example implementations that may be more efficient:

If your Δu variables are scalar and u is a vector, you can write this as ΔU2 += @views Δuᵢ * (u' * F2[:,i]) + Δuⱼ * (u' * F2[:,j]) + Δuₖ * (u' * F2[:,k]). If u has additional dimensions you can precompute vec(sum(u; dims=2:ndims(u))) and use that instead.

If your Δu values are scalar and u is a vector that does not change values very often compared to i,j,k, you can precompute uF2 = u' * F2 and then ΔU2 += Δuᵢ * uF2[i] + Δuⱼ * uF2[j] + Δuₖ * uF2[k]. If u has more dimensions, you can do the same reduction as suggested above.

Regardless, as others have suggested you can do this whole thing as a loop (or a function argument passed to sum, etc) to avoid any allocations at all from the broadcast.

ejmeitz · May 2, 2023, 2:05pm

Yeah this is faster and gets rid of all the allocations. Guess the BLAS implementation of dot product is faster than sum. Unfortunately u changes every time I call this function so don’t think I can do much more optimization to this.

DNF · May 2, 2023, 3:05pm

It’s very likely not related to BLAS, but is due to avoiding materialization of an intermediate array.

Can you divulge the types of your data? Is Δuᵢ scalar? Are the values floats or Ints?

If you want better help you should provide runnable code with actual dummy data.

ejmeitz · May 2, 2023, 4:07pm

Oh yeah all your assumptions were correct:
Δuᵢ is a Float and changes
u is a Nx1 vector{Float} and changes
F2 is a NxN Matrix{Float} and is constant

DNF · May 2, 2023, 9:07pm

I’m not sure exactly how you want to call your function, but here’s one way:

function foo(Δu, u, F2)
    (Δuᵢ, Δuⱼ, Δuₖ) = Δu
    (F2i, F2j, F2k) = F2
    return sum((Δuᵢ .* u .* F2i) .+ (Δuⱼ .* u .* F2j) .+ (Δuₖ .* u .* F2k))
end

function bar(Δu, u, F2)
    (Δuᵢ, Δuⱼ, Δuₖ) = Δu
    (F2i, F2j, F2k) = F2
    val = zero(eltype(u))  # should actually be more careful with this
    for n in eachindex(u, F2i, F2j, F2k)
        val += u[n] * (Δuᵢ * F2i[n] + Δuⱼ * F2j[n] + Δuₖ * F2k[n])
    end
    return val
end

using LoopVectorization

function tbar(Δu, u, F2)
    (Δuᵢ, Δuⱼ, Δuₖ) = Δu
    (F2i, F2j, F2k) = F2
    val = zero(eltype(u))
    @turbo for n in eachindex(u, F2i, F2j, F2k)
        val += u[n] * (Δuᵢ * F2i[n] + Δuⱼ * F2j[n] + Δuₖ * F2k[n])
    end
    return val
end


Δu = (randn(), randn(), randn());
u = randn(1000);
F = randn(1000, 10);
(i, j, k) = (2, 5, 8);

julia> @btime @views foo($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  2.200 μs (1 allocation: 7.94 KiB)
51.65534551576206

julia> @btime @views bar($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  1.200 μs (0 allocations: 0 bytes)
51.65534551576204

julia> @btime @views tbar($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  242.469 ns (0 allocations: 0 bytes)
51.65534551576206

ejmeitz · May 2, 2023, 10:45pm

Could you explain broadcasting vs for loops with @turbo? I thought LoopVectorization was more or less equivalent using broadcasting? Does broadcasting not take advantage of SIMD?

My current implementation is ~550 ns but I’m not passing the F matrix as a view which I probably should do.

Thanks!

lmiq · May 2, 2023, 10:56pm

There specifically it has to materialize the array before doing the sum.

(You can use turbo on broadcasted operations as well, by the way)

ejmeitz · May 2, 2023, 11:44pm

Ok cool, @turbo didn’t really make a different on the speed. My current method has no allocations, so as long as the broadcasting is using SIMD I dont think I can get much faster without parallelization.

lmiq · May 3, 2023, 12:09am

@turbo is more agressive on the vectorization at the expense of compilation time (for now?). Thus, in some situations it does get faster than the compiler-based simd. LoopVectorization has also the @tturbo macro that parallelizes the loop (when it can), but here in my machine it is not faster, I think in the example the memory accesses are rate limiting.

For instance, this is possible:

julia> function foo(Δu, u, F2, buff)
           (Δuᵢ, Δuⱼ, Δuₖ) = Δu
           (F2i, F2j, F2k) = F2
           @turbo @. buff = (Δuᵢ * u * F2i) + (Δuⱼ * u * F2j) + (Δuₖ * u * F2k)
           return sum(buff)
       end
foo (generic function with 2 methods)

julia> @btime @views foo($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]), $(similar(u)))
  383.355 ns (0 allocations: 0 bytes)
48.42891544451572

But in these MWEs, it is still slower than the tbar option where the sum is fused:

julia> @btime @views tbar($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  184.815 ns (0 allocations: 0 bytes)
48.428915444515724

seems that 184 ns is roughly the cost of running over an array of that size, using simd.

Topic		Replies	Views
Why is broadcasting allocating so much memory? Performance	8	726	May 21, 2020
Common allocation mistakes Performance memory-allocation	47	7174	August 21, 2023
Repeatedly summing displaced array, lots of allocations and slow New to Julia	5	511	August 20, 2021
Semantically equivalent expressions with wildly different performance General Usage macros , broadcasting , wat	10	294	July 28, 2025
How to avoid memory allocation when having long expression General Usage memory-allocation	16	775	May 28, 2021

Compiler Can't Optimize Away Unnecessary Memory Allocs with Convenience Variables

Related topics