Compiler Can't Optimize Away Unnecessary Memory Allocs with Convenience Variables

I have code doing some broadcasted operations and it takes ~2 microseconds which is great

ΔU2 += sum((Δuᵢ .* u .* @view F2[:,i]) .+ (Δuⱼ .* u .* @view F2[:,j]) .+ (Δuₖ .* u .* @view F2[:,k]))

However, this is entirely unreadable in my opinion. My original code had extra variables defined for some parts of this operation:

uᵢrow = Δuᵢ .* u
uⱼrow = Δuⱼ .* u
uₖrow = Δuₖ .* u

These variables are never used again in the function and were only created for readability. If I benchmark the two cases the first case where I just put everything in one line is faster because the compiler did not have to allocate new memory for the variables. Is this an intended feature of the compiler or just an existing limitation? is there some way to have these variables for readability without taking a performance hit?


2 Likes

You could use the macros @views and @. to avoid repeated view macros and broadcasting dots.

Maybe this would fit your style

sum(@views @. u * ( (Δuᵢ  * F2[:,i]) + (Δuⱼ * F2[:,j]) .+ (Δuₖ * F2[:,k])))

What you propose is a kind of lazy evaluation of arrays. There are some packages which do that, like GitHub - JuliaArrays/LazyArrays.jl: Lazy arrays and linear algebra in Julia which provides a @~ macro to denote lazy evaluation.

2 Likes

Yeah that is definitely prettier and good enough for what I’m doing.

Does the Julia compiler not check stuff like this though? My experience with C++ compilers is that they make this optimization for the user.

a = b .* c semantically has to materialize the broadcast into a result, in this case an Array. Since creating those crosses an FFI barrier (unfortunately), optimizing them is a bit harder than usual so that the allocation is moved to the stack instead of being tracked by GC.

1 Like

You can create un-materialized broadcasting (“lazy broadcasting”) via the Broadcast.broadcasted function. Note that this functionality appears to be internal so this usage may break at some future time, although it is used as an integral part of lowering (see @code_lowered sin.(x) .* 2, for example), so there’s a good chance it remains mostly stable.

Unlike Iterators.map, the resulting Broadcast.Broadcasted object should fuse with further broadcast operations.

julia> x = zeros(5); z = similar(x);

julia> ym = Iterators.map(sin, x)
Base.Generator{Vector{Float64}, typeof(sin)}(sin, [0.0, 0.0, 0.0, 0.0, 0.0])

julia> yb = Broadcast.broadcasted(sin, x)
Base.Broadcast.Broadcasted(sin, ([0.0, 0.0, 0.0, 0.0, 0.0],))

julia> using BenchmarkTools

julia> @btime $z .= $ym .* 2;
  38.042 ns (1 allocation: 96 bytes)

julia> @btime $z .= $yb .* 2;
  14.729 ns (0 allocations: 0 bytes)

You can also write a loop, which might be clearer, and intermediate scalars will be optimized out.

From the look of your initial expression, it appears that your broadcast-based expression might be doing redundant work. It’s impossible to say without knowing what types your variables actually are.

I’ll assume F2 is a 2-d array. Here are some example implementations that may be more efficient:

If your Δu variables are scalar and u is a vector, you can write this as ΔU2 += @views Δuᵢ * (u' * F2[:,i]) + Δuⱼ * (u' * F2[:,j]) + Δuₖ * (u' * F2[:,k]). If u has additional dimensions you can precompute vec(sum(u; dims=2:ndims(u))) and use that instead.

If your Δu values are scalar and u is a vector that does not change values very often compared to i,j,k, you can precompute uF2 = u' * F2 and then ΔU2 += Δuᵢ * uF2[i] + Δuⱼ * uF2[j] + Δuₖ * uF2[k]. If u has more dimensions, you can do the same reduction as suggested above.

Regardless, as others have suggested you can do this whole thing as a loop (or a function argument passed to sum, etc) to avoid any allocations at all from the broadcast.

1 Like

Yeah this is faster and gets rid of all the allocations. Guess the BLAS implementation of dot product is faster than sum. Unfortunately u changes every time I call this function so don’t think I can do much more optimization to this.

It’s very likely not related to BLAS, but is due to avoiding materialization of an intermediate array.

Can you divulge the types of your data? Is Δuᵢ scalar? Are the values floats or Ints?

If you want better help you should provide runnable code with actual dummy data.

2 Likes

Oh yeah all your assumptions were correct:
Δuᵢ is a Float and changes
u is a Nx1 vector{Float} and changes
F2 is a NxN Matrix{Float} and is constant

I’m not sure exactly how you want to call your function, but here’s one way:

function foo(Δu, u, F2)
    (Δuᵢ, Δuⱼ, Δuₖ) = Δu
    (F2i, F2j, F2k) = F2
    return sum((Δuᵢ .* u .* F2i) .+ (Δuⱼ .* u .* F2j) .+ (Δuₖ .* u .* F2k))
end

function bar(Δu, u, F2)
    (Δuᵢ, Δuⱼ, Δuₖ) = Δu
    (F2i, F2j, F2k) = F2
    val = zero(eltype(u))  # should actually be more careful with this
    for n in eachindex(u, F2i, F2j, F2k)
        val += u[n] * (Δuᵢ * F2i[n] + Δuⱼ * F2j[n] + Δuₖ * F2k[n])
    end
    return val
end

using LoopVectorization

function tbar(Δu, u, F2)
    (Δuᵢ, Δuⱼ, Δuₖ) = Δu
    (F2i, F2j, F2k) = F2
    val = zero(eltype(u))
    @turbo for n in eachindex(u, F2i, F2j, F2k)
        val += u[n] * (Δuᵢ * F2i[n] + Δuⱼ * F2j[n] + Δuₖ * F2k[n])
    end
    return val
end


Δu = (randn(), randn(), randn());
u = randn(1000);
F = randn(1000, 10);
(i, j, k) = (2, 5, 8);

julia> @btime @views foo($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  2.200 μs (1 allocation: 7.94 KiB)
51.65534551576206

julia> @btime @views bar($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  1.200 μs (0 allocations: 0 bytes)
51.65534551576204

julia> @btime @views tbar($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  242.469 ns (0 allocations: 0 bytes)
51.65534551576206
1 Like

Could you explain broadcasting vs for loops with @turbo? I thought LoopVectorization was more or less equivalent using broadcasting? Does broadcasting not take advantage of SIMD?

My current implementation is ~550 ns but I’m not passing the F matrix as a view which I probably should do.

Thanks!

There specifically it has to materialize the array before doing the sum.

(You can use turbo on broadcasted operations as well, by the way)

1 Like

Ok cool, @turbo didn’t really make a different on the speed. My current method has no allocations, so as long as the broadcasting is using SIMD I dont think I can get much faster without parallelization.

1 Like

@turbo is more agressive on the vectorization at the expense of compilation time (for now?). Thus, in some situations it does get faster than the compiler-based simd. LoopVectorization has also the @tturbo macro that parallelizes the loop (when it can), but here in my machine it is not faster, I think in the example the memory accesses are rate limiting.

For instance, this is possible:

julia> function foo(Δu, u, F2, buff)
           (Δuᵢ, Δuⱼ, Δuₖ) = Δu
           (F2i, F2j, F2k) = F2
           @turbo @. buff = (Δuᵢ * u * F2i) + (Δuⱼ * u * F2j) + (Δuₖ * u * F2k)
           return sum(buff)
       end
foo (generic function with 2 methods)

julia> @btime @views foo($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]), $(similar(u)))
  383.355 ns (0 allocations: 0 bytes)
48.42891544451572

But in these MWEs, it is still slower than the tbar option where the sum is fused:

julia> @btime @views tbar($Δu, $u, ($F[:, $i], $F[:, $j], $F[:, $k]))
  184.815 ns (0 allocations: 0 bytes)
48.428915444515724

seems that 184 ns is roughly the cost of running over an array of that size, using simd.