# Reduce allocation in nested for loops

I have a function that calculates the expectation and covariance of random variables on a grid with dimensions `(nx, ny, nz)`. The variables `y` and `z` are Markovian, with `Qy` and `Qz` being their Markov matrices. Functions `f` and `g` compute the exact values of random variables `a` and `b`. The example I gave below is a bit awkward since it is turning `Int` to `Float` but that is for illustration only.

Given the large scale of the problem (with additional layers), preallocating large matrices as and bs outside the loop is impractical due to memory constraints. To avoid data races, I allocate these matrices within the threaded loops.

Profiling reveals significant allocations related to tensor contractions and the matrices `as` and `bs`. Is there a way to optimize this function to reduce allocations and improve efficiency?

Thank you!

``````using TensorOperations
using LinearAlgebra

function calc_mats(nx, ny, nz, Qy, Qz)
## preallocation, output
μ_mat = Array{Float64}(undef, (nx, ny, nz))
C_mat = similar(μ_mat)
## loop over all states and decisions today
# states today
## preallocation, temporary variables, has to be within the threads
# given state and decisions today
as = Array{Float64}(undef, (ny, nz))
bs = Array{Float64}(undef, (nz))
abs = similar(as)
Qziz = @view Qz[iz, :]
Qyiy = @view Qy[iy, :]
for ix in 1:nx
# uncertainties tomorrow
# prepare return and flow shock matrices
for izz in 1:nz
bs[izz] = f(izz)
for iyy in 1:ny
as[iyy, izz] = g(iyy, izz, ix, iy, iz)
abs[iyy, izz] = as[iyy, izz] * bs[iyy]
end
end
# calculate mean and covariance
@show size(as), size(Qziz), size(Qyiy)
μ = @tensor as[iiyy, iizz] * Qziz[iizz] * Qyiy[iiyy]
μ_mat[ix, iy, iz] = μ
Eb = dot(bs, Qziz)
Eab = @tensor abs[iiyy, iizz] * Qziz[iizz] * Qyiy[iiyy]
C = Eab - μ * Eb
C_mat[ix, iy, iz] = C
end
end
end
return μ_mat, C_mat
end

g(a, b, c, d, e) = a/b+c-c/d +exp(e)
f(x) = sqrt(x)

nx = 3
ny = 5
nz = 7
Qy = randn(ny, ny)
Qz = randn(nz, nz)

calc_mats(nx, ny, nz, Qy, Qz)
``````

I tried different packages on the tensor contraction but it seems TensorOperations is already the best solution in terms of tensor contraction. I tried using calculating the Kronecker products of the Markovian matrices, reshape the result, and multiply with the `as` and `abs`. But it was not very helpful either.

I haven’t figured out a way to bypass the allocation of `as` `bs` `abs` within threaded loops. That should also be helpful.

Is there any reason you need to allocate these arrays inside the loops? I would suggest pulling it out.

Somehow I missed `Threads.@threads` there. Sorry. I now added it back. I tested the two versions, one allocating these variables outside the loop, one with threading for the `iz` `iy` levels and allocating them inside the loop to avoid data races. It turns out the threaded version is faster.