How to pre-allocate matrices to avoid constructing new matrices during the for-loop?

Hi guys, I have some codes as follows:

function sim(M,R,S,q,numIters,numGrids)
    for i = 1:numIters
        tmp = R + reshape(q' * S, numGrids, numGrids);
        q = (M - tmp) \ ((M + tmp) * q);
    end
    return q;
end

n = 100;
numIters = 10000;
M = rand(n,n);
R = rand(n,n);
S = rand(n,n*n);
q = rand(n);
@time sim(M,R,S,q,numIters,n);

If you run the codes, it will show 23.508989 seconds (220.00 k allocations: 3.765 GB, 1.33% gc time)

Clearly there is a lot of memory allocations. The main reasons are that I need to construct the intermediate matrix tmp in each for loop. I know that I should pre-allocate the matrix to avoid this issue, but how? I’m very new in Julia and really appreciate any comments and suggestions!

There is an issue with your code. Do you need to initialise tmp?

Then you have to look for inplace reshape and \. A starter is (to me this shaves of 1s out of 6):

function sim!(M,R,S,q,numIters,numGrids,tmp)
    tmp .= tmp .* 0
    for i = 1:numIters
        tmp .= R .+ reshape(q' * S, numGrids, numGrids);
        q .= (M .- tmp) \ ((M .+ tmp) * q);
    end
end

n = 100;
numIters = 10000;
M = rand(n,n);
R = rand(n,n);
S = rand(n,n*n);
q = rand(n);

tmp = rand(n,n);
@time sim!(M,R,S,q,numIters,n,tmp);

reshape shouldn’t be an allocation bottleneck. The array it returns shares the same underlying data. If it can’t it returns a ReshapedArray which is just a view.

julia> @btime reshape(1:100000, 2, :);
  25.831 ns (0 allocations: 0 bytes)

julia> @btime reshape($(collect(1:100000)), 2, :);
  37.775 ns (2 allocations: 96 bytes)

You can also pre-allocate temp before the for loop with temp = fill(0.0,size(M)), unless you need it outside the function for some reason.

The following code cuts down allocations:

function sim2(M,R,S,q,numIters,numGrids)
     tmp1 = zeros(size(q,2), size(S,2))
     tmp2 = zeros(numGrids, numGrids)
     tmp3 = zeros(M)
     tmp4 = zeros(M)
     tmp5 = zeros(size(M,1), size(q,2))
     for i = 1:numIters
         At_mul_B!(tmp1, q, S)
         tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
         tmp3 .= M .- tmp2
         tmp4 .= M .+ tmp2
         A_mul_B!(tmp5, tmp4, q)
         q = tmp3 \ tmp5
     end
     return q
end
#258.348244 seconds (100.01 k allocations: 782.930 MiB, 0.03% gc time)

The bottleneck is the system solve q = tmp3 \ tmp5 because it solves the system and allocates for the output every iteration. If you pre-factorize tmp3, you can use the inplace version A_ldiv_B!. The following code is significantly faster but it re-arranges the lines. It is obviously a cheat because it only factorizes once, but if you can cheat like this in your actual code then go for it!

function sim3(M,R,S,q,numIters,numGrids)
   tmp1 = zeros(size(q,2), size(S,2))
   tmp2 = zeros(numGrids, numGrids)
   tmp3 = zeros(M)
   tmp4 = zeros(M)
   tmp5 = zeros(size(M,1), size(q,2))

   At_mul_B!(tmp1, q, S)
   tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
   tmp3 .= M .- tmp2
   tmp4 .= M .+ tmp2
   fact = lufact(tmp4)
   for i = 1:numIters-1
       A_ldiv_B!(q, fact, tmp5)
       At_mul_B!(tmp1, q, S)
       tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
       tmp3 .= M .- tmp2
       tmp4 .= M .+ tmp2
       A_mul_B!(tmp5, tmp4, q)
   end
   A_ldiv_B!(q, fact, tmp5)

   return q
end
# 10.253734 seconds (30.02 k allocations: 1.452 MiB)

Try these on your machine, and check the timings.

You can use an in-place version of an iterative solver when solving the linear system, e.g.,

using IterativeSolvers

function sim2(M,R,S,q,numIters,numGrids)
     tmp1 = zeros(size(q,2), size(S,2))
     tmp2 = zeros(numGrids, numGrids)
     tmp3 = zeros(M)
     tmp4 = zeros(M)
     tmp5 = zeros(size(M,1), size(q,2))
     for i = 1:numIters
         At_mul_B!(tmp1, q, S)
         tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
         tmp3 .= M .- tmp2
         tmp4 .= M .+ tmp2
         A_mul_B!(tmp5, tmp4, q)
         cg!(q, tmp3, tmp5) # Solve for q in-place using iterative solver ConjugateGradients
     end
     return q
end

If q is not expected to change much between iterations, the last q will be a very good initial guess for the iterative solver and the solution will be found very fast.

You can also cut down on the allocations caused by the solver by using lufact!, without cheating. This is not entirely allocation-free but uses much fewer. Although I agree with @baggepinnen that iterative solvers are almost surely better here, depending on numerical values (moderately small updates).

function sim_lu(M,R,S,q,numIters,numGrids)
     tmp1 = zeros(size(q,2), size(S,2))
     tmp2 = zeros(numGrids, numGrids)
     tmp3 = zeros(M)
     tmp4 = zeros(M)
     tmp5 = zeros(size(M,1), size(q,2))
     for i = 1:numIters
         At_mul_B!(tmp1, q, S)
         tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
         tmp3 .= M .- tmp2
         tmp4 .= M .+ tmp2
         A_mul_B!(tmp5, tmp4, q)
         fact = lufact!(tmp3)
         A_ldiv_B!(q,fact,tmp5)
      end
     return q
end