Of course your code is a bit weird because it does the same thing at every iteration, so we could just do it once and multiply the resulting Arg1
by N
. For the sake of comparison however, I wrote another function that performs exactly the same computations.
The better memory management is enabled by pre-allocating outputs and fusing vectorized operations.
In addition I used a few specific linear algebra functions:
dot(matt1, matt2)
to replace sum(matt1 .* matt2)
(which allocates the elementwise product before summing it)
mul!(matt2, mat, mat)
instead of matt2 = (mat*mat)
to avoid allocating the matrix product since I already know where I’m gonna store it
julia> using BenchmarkTools, LinearAlgebra
julia> function better_vec_prod_aux!(mat, matt1, matt2, N, n)
Arg1 = 0.0
for i = 1:N
vec1 = 1:n
mat .= vec1 .* vec1'
matt1 .= (mat .* mat) .* (1 / n^2)
mul!(matt2, mat, mat)
matt2 .*= 1 / n^2
Arg1 = Arg1 + dot(matt1, matt2) / N
end
return Arg1
end;
julia> function better_vec_prod(N, n)
mat = Matrix{Float64}(undef, n, n)
matt1 = Matrix{Float64}(undef, n, n)
matt2 = Matrix{Float64}(undef, n, n)
return better_vec_prod_aux!(mat, matt1, matt2, N, n)
end;
julia> vec_prod(2, 5) == better_vec_prod(2, 5)
true
julia> @btime vec_prod(10, 1000);
3.820 s (154 allocations: 473.33 MiB)
julia> @btime better_vec_prod(10, 1000);
238.816 ms (6 allocations: 22.89 MiB)
julia> @btime better_vec_prod_aux!(mat, matt1, matt2, 10, 1000) setup = (
n = 1000;
mat = Matrix{Float64}(undef, n, n);
matt1 = Matrix{Float64}(undef, n, n);
matt2 = Matrix{Float64}(undef, n, n)
);
247.210 ms (0 allocations: 0 bytes)
As you can see, in this version, the only allocations happen at the beginning of the loop: better_vec_prod_aux!
performs none. This is especially important when dealing with large matrices like you are (1000 x 1000).