Why does the first variant (function foo) allocate a large amount of memory while the second variant (function bar) does not? The difference is whether the loop is inside or outside the function. Is there a way to have the loop outside without so many allocations?
using BenchmarkTools
function foo!(F, A)
F[1] = A[2]
F[2] = - A[1]
return nothing
end
function bar!(F, A)
for i = 1:size(A,1)
F[i,1] = A[i,2]
F[i,2] = -A[i,1]
end
return nothing
end
n = 15_000
A = rand(n,2)
F = similar(A)
@btime begin
for i = 1:n
@views foo!((F[i, :]), (A[i, :]))
end
end
@btime begin
bar!(F, A)
end
@btime begin
for i = 1:n
@views foo!((F[i, :]), (A[i, :]))
end
end
n is a (non-const, non-typed) global variable, meaning the compiler doesn’t know its type, nor that of i. So in every iteration we need to check the types involved. The same is true for A and F. If you declare them all const (note that this still allows in-place mutation of the Arrays), I get
julia> @btime begin
for i = 1:n
@views foo!((F[i, :]), (A[i, :]))
end
end
13.000 μs (0 allocations: 0 bytes)
Alternatively, BenchmarkTools.jl also allows for interpolating global variables using $:
julia> ... # non-const n, A, F
julia> @btime begin
for i = 1:$n
@views foo!(($F[i, :]), ($A[i, :]))
end
end
11.000 μs (0 allocations: 0 bytes)
I’m not sure why we don’t need such interpolation for bar!, though. But even if we did, we only need to determine the types of F and A once, instead of once in every iteration, so the timing and allocation difference between the interpolated and non-interplated version would be much less pronounced.
Keep in mind you probably want everything where performance matters to be in a function, because it’s not compiled otherwise. See the Performance tips for more such suggestions: