Of course your code is a bit weird because it does the same thing at every iteration, so we could just do it once and multiply the resulting `Arg1`

by `N`

. For the sake of comparison however, I wrote another function that performs exactly the same computations.

The better memory management is enabled by pre-allocating outputs and fusing vectorized operations.

In addition I used a few specific linear algebra functions:

`dot(matt1, matt2)`

to replace `sum(matt1 .* matt2)`

(which allocates the elementwise product before summing it)
`mul!(matt2, mat, mat)`

instead of `matt2 = (mat*mat)`

to avoid allocating the matrix product since I already know where Iâ€™m gonna store it

```
julia> using BenchmarkTools, LinearAlgebra
julia> function better_vec_prod_aux!(mat, matt1, matt2, N, n)
Arg1 = 0.0
for i = 1:N
vec1 = 1:n
mat .= vec1 .* vec1'
matt1 .= (mat .* mat) .* (1 / n^2)
mul!(matt2, mat, mat)
matt2 .*= 1 / n^2
Arg1 = Arg1 + dot(matt1, matt2) / N
end
return Arg1
end;
julia> function better_vec_prod(N, n)
mat = Matrix{Float64}(undef, n, n)
matt1 = Matrix{Float64}(undef, n, n)
matt2 = Matrix{Float64}(undef, n, n)
return better_vec_prod_aux!(mat, matt1, matt2, N, n)
end;
julia> vec_prod(2, 5) == better_vec_prod(2, 5)
true
julia> @btime vec_prod(10, 1000);
3.820 s (154 allocations: 473.33 MiB)
julia> @btime better_vec_prod(10, 1000);
238.816 ms (6 allocations: 22.89 MiB)
julia> @btime better_vec_prod_aux!(mat, matt1, matt2, 10, 1000) setup = (
n = 1000;
mat = Matrix{Float64}(undef, n, n);
matt1 = Matrix{Float64}(undef, n, n);
matt2 = Matrix{Float64}(undef, n, n)
);
247.210 ms (0 allocations: 0 bytes)
```

As you can see, in this version, the only allocations happen at the beginning of the loop: `better_vec_prod_aux!`

performs none. This is especially important when dealing with large matrices like you are (1000 x 1000).