One more trick is to avoid using the temporary allocations at all and simply write to theta directly by unfolding in a triple for-loop, accessing the memory in an efficient order. I’ve made this change for the last for-loop, and if I didn’t make a linear algebra mistake, that is again a bit faster; the other loop should allow you to get a similar speed-up (at least by eliminating temp2)
function uncompress_further_optimized(compressionIndexes, XC::AbstractVector{T}, DC, IDC) where {T<:Real}
nb = size(DC[1], 1)
nk = size(DC[2], 1)
ny = size(DC[3], 1)
θ1 = Array{eltype(XC)}(undef, nb, nk, ny)
# Populate the 3D array using compressionIndexes.
@inbounds for j in eachindex(XC)
θ1[compressionIndexes[j]] = XC[j]
end
@views temp1 = similar(θ1[:, :, 1])
@views temp2 = similar(θ1[:, :, 1])
@inbounds @views for yy in axes(θ1, 3)
# Multiply: temp1 = IDC[1] * θ1[:, :, yy]
mul!(temp1, IDC[1], θ1[:, :, yy])
# Multiply: temp2 = temp1 * DC[2]
mul!(temp2, temp1, DC[2])
# Write the result back.
θ1[:, :, yy] .= temp2
end
@inbounds @views for i in axes(θ1, 2), bb in axes(θ1, 1), k in axes(θ1, 3)
θ1[bb, i, k] = θ1[bb, i, :] ⋅ DC[3][:, k]
end
return reshape(θ1, nb * nk * ny)
end
gives
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 35.625 μs … 4.564 ms ┊ GC (min … max): 0.00% … 98.41%
Time (median): 44.625 μs ┊ GC (median): 0.00%
Time (mean ± σ): 49.842 μs ± 88.113 μs ┊ GC (mean ± σ): 9.19% ± 5.42%
▃▆██▄▁
▂▄▄▄▄▄▃▃▃▄▅███████▇▆▅▄▄▃▂▂▂▂▂▁▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
35.6 μs Histogram: frequency by time 71.3 μs <
Memory estimate: 153.30 KiB, allocs estimate: 33.