A better strategy than your current one would be to not use an array-of-arrays and instead write this as a 3-d tensor contraction. BLAS does not support that, but Tullio.jl + LoopVectorization.jl is basically just as good if not better.
julia> using Tullio, LoopVectorization, BenchmarkTools
julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)
julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)
julia> foreach([50, 500, 1000]) do dim
Gmρ = [rand(dim,18) for _ in 1:1024]
Gρq = [rand(18,dim) for _ in 1:1024]
Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
@show dim
out1 = @btime multiplyAndSum_tullio($Gmρ_tensor, $Gρq_tensor)
out2 = @btime multiplyAndSum($Gmρ, $Gρq)
@show out1 ≈ out2
println()
end
dim = 50
1.556 ms (74 allocations: 24.88 KiB)
12.235 ms (4095 allocations: 39.27 MiB)
out1 ≈ out2 = true
dim = 500
74.228 ms (76 allocations: 1.91 MiB)
721.703 ms (4095 allocations: 3.81 GiB)
out1 ≈ out2 = true
dim = 1000
308.981 ms (77 allocations: 7.63 MiB)
6.334 s (4095 allocations: 15.25 GiB)
out1 ≈ out2 = true
A couple notes:
Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
are just me turning your arrays of arrays into equivalent 3D tensors. It’s best to generate them in this form, not turn them into this form after creating them as that would be inefficient.
- The Tullio tensor contraction I wrote:
multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
essentially says create a 2D matrix out
whose [i,j]
th element is the sum over the indices k
and l
of Gmρ[i, k, l] * Gρq[k, j, l]
.