Temporary pre-allocated array within function is slower than non-preallocated?

I have the following code, which performs a intermediary operation, and stores the final result in the output vector out using the max function. There are two ways to do this, one is to pre-allocate a temporary vector to hold the intermediary values, then put it in the final out vector (function tt), and the other one is to just create temp vectors within the loop (function tt1) and store them in the out vector.

I thought that pre-allocating would be the best practice and the most efficient one, but it is actually much slower. Am I doing something wrong, or is it supposed to be like this?

using BenchmarkTools
function tt(X::Vector{S}, Y::Vector{S}) where S
    out = Vector{S}(undef, length(X))
    temp = Array{S,2}(undef, length(Y),2)
    for ii = eachindex(out)
        temp[:, 1] = @. X[ii]^2 + Y
        temp[:, 2] = @. X[ii]^3 + Y
        out[ii] = mean(max.(temp[:,1], temp[:,2]))
    return out
function tt1(X::Vector{S}, Y::Vector{S}) where S
    out = Vector{S}(undef, length(X))
    for ii = eachindex(out)
        v1 = @. X[ii]^2 + Y
        v2 = @. X[ii]^3 + Y
        out[ii] = mean(max.(v1,v2))
    return out

@btime tt(X,Y);
@btime tt1(X,Y);#identical outputs as expected

  93.869 ms (10003 allocations: 382.01 MiB) #slower
  47.507 ms (6001 allocations: 229.12 MiB)

This isn’t doing what you want. The right-hand-side is creating a brand new vector, and then you’re (unnecessary) copying that new vector into temp.

Use .= to combine the assignment with the rest of the broadcasted operation to avoid that, or just put @. in front of the whole line.

Oh, but also isn’t ii a scalar? Why do you have any broadcasting at all?

You are right, the problem was in the @.. Thanks! The broadcasting is necessary because Y is a vector while X[ii] is a scalar. And in the larger program I’m writing, it is more complicated than that so broadcasting is necessary.

Now I realized that the difference in speed is coming from this:

function ttt(X, Y)
    exp(X[1]*2 +  Y ^X[2] )
function ttt1(X, Y)
    @. exp(X[1]*2 +  Y^X[2])

@btime ttt.(Ref(X),Y) #1.060 ms (5 allocations: 78.28 KiB)
@btime ttt1(X,Y) #1.047 ms (2 allocations: 78.20 KiB)

Obviously with such a simple code, the difference in time is negligible but the allocation difference is there. In my real application, the difference in number of allocations is huge and the main bottleneck. Is it always better to use @. instead of relying on Ref?