Matrix multiplication with Duals

One more trick is to avoid using the temporary allocations at all and simply write to theta directly by unfolding in a triple for-loop, accessing the memory in an efficient order. I’ve made this change for the last for-loop, and if I didn’t make a linear algebra mistake, that is again a bit faster; the other loop should allow you to get a similar speed-up (at least by eliminating temp2)

function uncompress_further_optimized(compressionIndexes, XC::AbstractVector{T}, DC, IDC) where {T<:Real}
    nb = size(DC[1], 1)
    nk = size(DC[2], 1)
    ny = size(DC[3], 1)
    θ1 = Array{eltype(XC)}(undef, nb, nk, ny)

    # Populate the 3D array using compressionIndexes.
    @inbounds for j in eachindex(XC)
        θ1[compressionIndexes[j]] = XC[j]
    end

    @views temp1 = similar(θ1[:, :, 1])
    @views temp2 = similar(θ1[:, :, 1])

    @inbounds @views for yy in axes(θ1, 3)
        # Multiply: temp1 = IDC[1] * θ1[:, :, yy]
        mul!(temp1, IDC[1], θ1[:, :, yy])
        # Multiply: temp2 = temp1 * DC[2]
        mul!(temp2, temp1, DC[2])
        # Write the result back.
        θ1[:, :, yy] .= temp2
    end

    @inbounds @views for i in axes(θ1, 2), bb in axes(θ1, 1), k in axes(θ1, 3)
        θ1[bb, i, k] = θ1[bb, i, :] ⋅ DC[3][:, k]
    end

    return reshape(θ1, nb * nk * ny)
end

gives

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  35.625 μs …  4.564 ms  ┊ GC (min … max): 0.00% … 98.41%
 Time  (median):     44.625 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.842 μs ± 88.113 μs  ┊ GC (mean ± σ):  9.19% ±  5.42%

              ▃▆██▄▁                                           
  ▂▄▄▄▄▄▃▃▃▄▅███████▇▆▅▄▄▃▂▂▂▂▂▁▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  35.6 μs         Histogram: frequency by time        71.3 μs <

 Memory estimate: 153.30 KiB, allocs estimate: 33.