Performance of function with pre-allocated outputs

Consider the following mutating function that mutates a vector input:

function f_test!(xout::Vector{T}, a::T, b::T) where T <: Real
	xout .= a .* b
	view(xout, 1:max(1, lastindex(xout)-5)) .-= b
	return xout
end

I timed this with julia version 1.9.0-rc3 with the following results:

ntest = 100
const xout = zeros(ntest)
@btime f_test!($xout, 1.0, 2.0)
24.293 ns (0 allocations: 0 bytes)

Inspecting the LLVM code shows compilation of SIMD instructions.

I tried wrapping the function in a constructor that would create a pre-allocated vector and an anonymous function that only takes two arguments and uses the hidden internal state.

function test_allocations(n::Integer)
	xout = zeros(n)
	(a, b) -> f_test!(xout, a, b)
end

And then I created an instance of this function with the same pre-allocated size as the above test.

f_alloc = test_allocations(ntest)

Timing this led to slower results and I also noticed that the generated code didn’t have any SIMD instructions. Although different versions of the f_test! function did result in SIMD instructions here too.

@btime f_alloc(1.0, 2.0)
36.827 ns (0 allocations: 0 bytes)
@code_llvm f_alloc(1.0, 2.0)
define nonnull {}* @"julia_#1_19394"([1 x {}*]* nocapture noundef nonnull readonly align 8 dereferenceable(8) %0, double %1, double %2) #0 {
top:
  %3 = getelementptr inbounds [1 x {}*], [1 x {}*]* %0, i64 0, i64 0
  %4 = load atomic {}*, {}** %3 unordered, align 8
  %5 = call nonnull {}* @"j_f_test!_19396"({}* nonnull %4, double %1, double %2) #0
  ret {}* %5
}

Is there a way to define this constructor so the compiled code is the same? In both cases there are zero allocations, but I gather from reading previous discussions that always passing the arguments explicitly is better. If I did wanna do something like this with having internal states of pre-allocated outputs for functions, is there a better way of doing this than a wrote here?

Looking to understand what is going on with this example and to have general tips on the best way to design this sort of thing. Thanks!

f_alloc is not a const.

julia> @btime $f_alloc(1.0, 2.0);
  21.310 ns (0 allocations: 0 bytes)

fixes the problem for me.

If you intend to call f_alloc from other functions without passing it around as an input, you’ll need to make it a const.

const f_alloc = test_allocations(ntest)

As for the code, the main “issue” is that f_test! did not inline. You could mark it @inline.
Alternatively, just use Cthulhu.jl to descend into any functions of interest when inspecting code.

I see thanks. Yeah setting it as a constant makes the timing equal. Do you know why the timing is equal even though one uses SIMD and the other doesn’t? I saw that marking it with @inline made the compiled code the same.

Without @inline, the one not using SIMD calls the one that does. Hence, they take the same amount of time.

This is why I suggested using Cthulhu. L will show you the LLVM.
It will let you descend to find the SIMD instructions.