I am trying to reduce memory allocations in a function through preallocations. However, implementing the Perfomance Tips still leaves a significant amount of probably unnecessary allocations, which I can circumvent if I write the code outside of the function. I have reduced the code to this trivial example:

julia> n = 10;
julia> T = UInt32;
julia> M = rand(T, (n,n));
julia> V = zeros(T, n);
julia> function f(T, M::Matrix, n::Int64)
V = zeros(T, n);
@views @inbounds for i = 1:n
V .= view(M,:, i);
end
end;
julia> @btime f($T,$M,$n)
400.000 ns (11 allocations: 576 bytes)
julia> @btime for i = 1:$n
copyto!($V, view($M,:, $i));
end
48.927 ns (0 allocations: 0 bytes)

The same behaviour is achieved when replacing the .= by copyto!. An even simpler example is the following

julia> f(M::Matrix) = [view(M,:,i) for i = 1:n];
julia> @btime f(M);
143.457 ns (4 allocations: 576 bytes)
julia> @btime [view($M,:,$i) for i = 1:$n];
59.440 ns (1 allocation: 496 bytes)

Whatâ€™s the origin of the additional allocations and how can they be avoided?

Since this is quite a basic question, I assume there is some documentation or discussion on this, but I havenâ€™t been lucky so far.

In the first function you have two things going on:

You actually explicitly create a vector inside the function. Thatâ€™s an allocation right there. If you want to avoid this, create the vector outside, and pass it into the function.

The second thing is the use of the parameter T. The function will not specialize to the type UInt32. T is a value, not a type in this case, and the compiler only sees it as a DataType. If you want specialization, write

Thanks so much! I am aware of the explicit allocation, however I didnâ€™t expect so many different allocations to happen. In my actual use case, I am bound by the allocations that occur in every iteration of the loop. So maybe this example is better suited to demonstrate my issue:

julia> function f(::Type{T}, M::Matrix, n::Int64, k::Int64)
V = zeros(T, n);
@inbounds for _ = 1:k
copyto!(V, view(M,:, 1));
end
end;
julia> n1 = 1000;
julia> n2= 100_000;
julia> @btime f($T,$M,$n,n1)
13.041 ÎĽs (1001 allocations: 46.97 KiB)
julia> @btime f($T,$M,$n,n2)
1.311 ms (100001 allocations: 4.58 MiB)
julia> @btime @inbounds for _ = 1:$n1
copyto!($V, view($M,:, 1));
end
4.744 ÎĽs (0 allocations: 0 bytes)
julia> @btime @inbounds for _ = 1:$n2
copyto!($V, view($M,:, 1));
end
473.041 ÎĽs (0 allocations: 0 bytes)

Iâ€™m not at my computer, and must admit I donâ€™t immediately see the cause. I notice you didnâ€™t interpolate n1 and n2, but I doubt that is the cause.

(Aside: does it not make more sense to pass T as part of Mâ€™s signature?)

Yes of course, youâ€™re right that solved the issue in the trivial example. Based on that, I have also tried to narrow it down and outsourced the allocation of V and pass it as an argument instead:

julia> n = 10;
julia> T = UInt32;
julia> M = rand(T, (n,n));
julia> V = zeros(T, n);
julia> function f1(::Type{T}, M::Matrix{T}, n::Int64, k::Int64)
V::Vector{T} = zeros(T, n);
for _ = 1:k
copyto!(V, view(M,:, 1));
end
end;
julia> function f2(::Type{T}, M::Matrix{T}, V::Vector{T}, n::Int64, k::Int64)
for _ = 1:k
copyto!(V, view(M,:, 1));
end
end;
julia> n2= 100_000;
julia> @btime f1($T,$M,$n,$n2)
1.286 ms (100002 allocations: 4.58 MiB)
julia> @btime f2($T,$M,$V,$n,$n2)
473.208 ÎĽs (1 allocation: 16 bytes)

So it appears that V gets reallocated in every iteration of the loop, but I donâ€™t understand why? It is probably related to scoping but shouldnâ€™t V be still in scope?

would not be allowed, and you would get ERROR: UndefVarError: T not defined. You need to tell Julia that T is a type variable or parameter, which you do by appending where {T}, which basically means â€śwhere T is a type parameterâ€ť.

If you donâ€™t have the where clause, Julia will try to look up what T means, and in this case it was already defined earlier in your code as T = UInt32. This will then be used, and your slow function defnition is (almost):

function f3(::Type{UInt32}, M::Matrix{UInt32}, n::Int64, k::Int64)

except the type parameter is stored in the variable T, which happens to be a non-const global variable. So, firstly, your definition will only work for UInt32, and, secondly, it will be slow.

Adding where {T} tells Julia that T is a parameter, not a particular global value.

The where { T } bit can also be written as where { T <: Any }. Youâ€™re essentially telling the Julia compiler that you want to allow a different version of the function to be compiled for any element type of the incoming matrix.