How to pre-allocate matrices to avoid constructing new matrices during the for-loop?

Hi guys, I have some codes as follows:

function sim(M,R,S,q,numIters,numGrids)
    for i = 1:numIters
        tmp = R + reshape(q' * S, numGrids, numGrids);
        q = (M - tmp) \ ((M + tmp) * q);
    end
    return q;
end

n = 100;
numIters = 10000;
M = rand(n,n);
R = rand(n,n);
S = rand(n,n*n);
q = rand(n);
@time sim(M,R,S,q,numIters,n);

If you run the codes, it will show 23.508989 seconds (220.00 k allocations: 3.765 GB, 1.33% gc time)

Clearly there is a lot of memory allocations. The main reasons are that I need to construct the intermediate matrix tmp in each for loop. I know that I should pre-allocate the matrix to avoid this issue, but how? I’m very new in Julia and really appreciate any comments and suggestions!

1 Like

There is an issue with your code. Do you need to initialise tmp?

Then you have to look for inplace reshape and \. A starter is (to me this shaves of 1s out of 6):

function sim!(M,R,S,q,numIters,numGrids,tmp)
    tmp .= tmp .* 0
    for i = 1:numIters
        tmp .= R .+ reshape(q' * S, numGrids, numGrids);
        q .= (M .- tmp) \ ((M .+ tmp) * q);
    end
end

n = 100;
numIters = 10000;
M = rand(n,n);
R = rand(n,n);
S = rand(n,n*n);
q = rand(n);

tmp = rand(n,n);
@time sim!(M,R,S,q,numIters,n,tmp);
2 Likes

reshape shouldn’t be an allocation bottleneck. The array it returns shares the same underlying data. If it can’t it returns a ReshapedArray which is just a view.

julia> @btime reshape(1:100000, 2, :);
  25.831 ns (0 allocations: 0 bytes)

julia> @btime reshape($(collect(1:100000)), 2, :);
  37.775 ns (2 allocations: 96 bytes)
3 Likes

You can also pre-allocate temp before the for loop with temp = fill(0.0,size(M)), unless you need it outside the function for some reason.

1 Like

The following code cuts down allocations:

function sim2(M,R,S,q,numIters,numGrids)
     tmp1 = zeros(size(q,2), size(S,2))
     tmp2 = zeros(numGrids, numGrids)
     tmp3 = zeros(M)
     tmp4 = zeros(M)
     tmp5 = zeros(size(M,1), size(q,2))
     for i = 1:numIters
         At_mul_B!(tmp1, q, S)
         tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
         tmp3 .= M .- tmp2
         tmp4 .= M .+ tmp2
         A_mul_B!(tmp5, tmp4, q)
         q = tmp3 \ tmp5
     end
     return q
end
#258.348244 seconds (100.01 k allocations: 782.930 MiB, 0.03% gc time)

The bottleneck is the system solve q = tmp3 \ tmp5 because it solves the system and allocates for the output every iteration. If you pre-factorize tmp3, you can use the inplace version A_ldiv_B!. The following code is significantly faster but it re-arranges the lines. It is obviously a cheat because it only factorizes once, but if you can cheat like this in your actual code then go for it!

function sim3(M,R,S,q,numIters,numGrids)
   tmp1 = zeros(size(q,2), size(S,2))
   tmp2 = zeros(numGrids, numGrids)
   tmp3 = zeros(M)
   tmp4 = zeros(M)
   tmp5 = zeros(size(M,1), size(q,2))

   At_mul_B!(tmp1, q, S)
   tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
   tmp3 .= M .- tmp2
   tmp4 .= M .+ tmp2
   fact = lufact(tmp4)
   for i = 1:numIters-1
       A_ldiv_B!(q, fact, tmp5)
       At_mul_B!(tmp1, q, S)
       tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
       tmp3 .= M .- tmp2
       tmp4 .= M .+ tmp2
       A_mul_B!(tmp5, tmp4, q)
   end
   A_ldiv_B!(q, fact, tmp5)

   return q
end
# 10.253734 seconds (30.02 k allocations: 1.452 MiB)

Try these on your machine, and check the timings.

2 Likes

You can use an in-place version of an iterative solver when solving the linear system, e.g.,

using IterativeSolvers

function sim2(M,R,S,q,numIters,numGrids)
     tmp1 = zeros(size(q,2), size(S,2))
     tmp2 = zeros(numGrids, numGrids)
     tmp3 = zeros(M)
     tmp4 = zeros(M)
     tmp5 = zeros(size(M,1), size(q,2))
     for i = 1:numIters
         At_mul_B!(tmp1, q, S)
         tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
         tmp3 .= M .- tmp2
         tmp4 .= M .+ tmp2
         A_mul_B!(tmp5, tmp4, q)
         cg!(q, tmp3, tmp5) # Solve for q in-place using iterative solver ConjugateGradients
     end
     return q
end

If q is not expected to change much between iterations, the last q will be a very good initial guess for the iterative solver and the solution will be found very fast.

2 Likes

You can also cut down on the allocations caused by the solver by using lufact!, without cheating. This is not entirely allocation-free but uses much fewer. Although I agree with @baggepinnen that iterative solvers are almost surely better here, depending on numerical values (moderately small updates).

function sim_lu(M,R,S,q,numIters,numGrids)
     tmp1 = zeros(size(q,2), size(S,2))
     tmp2 = zeros(numGrids, numGrids)
     tmp3 = zeros(M)
     tmp4 = zeros(M)
     tmp5 = zeros(size(M,1), size(q,2))
     for i = 1:numIters
         At_mul_B!(tmp1, q, S)
         tmp2 .= R .+ reshape(tmp1, numGrids, numGrids)
         tmp3 .= M .- tmp2
         tmp4 .= M .+ tmp2
         A_mul_B!(tmp5, tmp4, q)
         fact = lufact!(tmp3)
         A_ldiv_B!(q,fact,tmp5)
      end
     return q
end
2 Likes