Of course your code is a bit weird because it does the same thing at every iteration, so we could just do it once and multiply the resulting Arg1 by N. For the sake of comparison however, I wrote another function that performs exactly the same computations.
The better memory management is enabled by pre-allocating outputs and fusing vectorized operations.
In addition I used a few specific linear algebra functions:
dot(matt1, matt2) to replace sum(matt1 .* matt2) (which allocates the elementwise product before summing it)
mul!(matt2, mat, mat) instead of matt2 = (mat*mat) to avoid allocating the matrix product since I already know where I’m gonna store it
julia> using BenchmarkTools, LinearAlgebra
julia> function better_vec_prod_aux!(mat, matt1, matt2, N, n)
Arg1 = 0.0
for i = 1:N
vec1 = 1:n
mat .= vec1 .* vec1'
matt1 .= (mat .* mat) .* (1 / n^2)
mul!(matt2, mat, mat)
matt2 .*= 1 / n^2
Arg1 = Arg1 + dot(matt1, matt2) / N
end
return Arg1
end;
julia> function better_vec_prod(N, n)
mat = Matrix{Float64}(undef, n, n)
matt1 = Matrix{Float64}(undef, n, n)
matt2 = Matrix{Float64}(undef, n, n)
return better_vec_prod_aux!(mat, matt1, matt2, N, n)
end;
julia> vec_prod(2, 5) == better_vec_prod(2, 5)
true
julia> @btime vec_prod(10, 1000);
3.820 s (154 allocations: 473.33 MiB)
julia> @btime better_vec_prod(10, 1000);
238.816 ms (6 allocations: 22.89 MiB)
julia> @btime better_vec_prod_aux!(mat, matt1, matt2, 10, 1000) setup = (
n = 1000;
mat = Matrix{Float64}(undef, n, n);
matt1 = Matrix{Float64}(undef, n, n);
matt2 = Matrix{Float64}(undef, n, n)
);
247.210 ms (0 allocations: 0 bytes)
As you can see, in this version, the only allocations happen at the beginning of the loop: better_vec_prod_aux! performs none. This is especially important when dealing with large matrices like you are (1000 x 1000).