Reduce allocation in nested for loops

I have a function that calculates the expectation and covariance of random variables on a grid with dimensions (nx, ny, nz). The variables y and z are Markovian, with Qy and Qz being their Markov matrices. Functions f and g compute the exact values of random variables a and b. The example I gave below is a bit awkward since it is turning Int to Float but that is for illustration only.

Given the large scale of the problem (with additional layers), preallocating large matrices as and bs outside the loop is impractical due to memory constraints. To avoid data races, I allocate these matrices within the threaded loops.

Profiling reveals significant allocations related to tensor contractions and the matrices as and bs. Is there a way to optimize this function to reduce allocations and improve efficiency?

Thank you!

using TensorOperations
using LinearAlgebra

function calc_mats(nx, ny, nz, Qy, Qz)
    ## preallocation, output
    μ_mat = Array{Float64}(undef, (nx, ny, nz))
    C_mat = similar(μ_mat)
    ## loop over all states and decisions today
    # states today
    #Threads.@threads 
    Threads.@threads for iz in 1:nz
        Threads.@threads for iy in 1:ny
            ## preallocation, temporary variables, has to be within the threads
            # given state and decisions today
            as = Array{Float64}(undef, (ny, nz))
            bs = Array{Float64}(undef, (nz))
            abs = similar(as)
            Qziz = @view Qz[iz, :]
            Qyiy = @view Qy[iy, :]
            for ix in 1:nx
                # uncertainties tomorrow
                # prepare return and flow shock matrices
                for izz in 1:nz
                    bs[izz] = f(izz)
                    for iyy in 1:ny
                        as[iyy, izz] = g(iyy, izz, ix, iy, iz)
                        abs[iyy, izz] = as[iyy, izz] * bs[iyy]
                    end
                end
                # calculate mean and covariance
                @show size(as), size(Qziz), size(Qyiy)
                μ = @tensor as[iiyy, iizz] * Qziz[iizz] * Qyiy[iiyy]
                μ_mat[ix, iy, iz] = μ
                Eb = dot(bs, Qziz)
                Eab = @tensor abs[iiyy, iizz] * Qziz[iizz] * Qyiy[iiyy]
                C = Eab - μ * Eb
                C_mat[ix, iy, iz] = C
            end
        end
    end
    return μ_mat, C_mat
end

g(a, b, c, d, e) = a/b+c-c/d +exp(e)
f(x) = sqrt(x)

nx = 3
ny = 5
nz = 7
Qy = randn(ny, ny)
Qz = randn(nz, nz)

calc_mats(nx, ny, nz, Qy, Qz)

I tried different packages on the tensor contraction but it seems TensorOperations is already the best solution in terms of tensor contraction. I tried using calculating the Kronecker products of the Markovian matrices, reshape the result, and multiply with the as and abs. But it was not very helpful either.

I haven’t figured out a way to bypass the allocation of as bs abs within threaded loops. That should also be helpful.

Is there any reason you need to allocate these arrays inside the loops? I would suggest pulling it out.

Somehow I missed Threads.@threads there. Sorry. I now added it back. I tested the two versions, one allocating these variables outside the loop, one with threading for the iz iy levels and allocating them inside the loop to avoid data races. It turns out the threaded version is faster.