A better strategy than your current one would be to not use an array-of-arrays and instead write this as a 3-d tensor contraction. BLAS does not support that, but Tullio.jl + LoopVectorization.jl is basically just as good if not better.

```
julia> using Tullio, LoopVectorization, BenchmarkTools
julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)
julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)
```

```
julia> foreach([50, 500, 1000]) do dim
Gmρ = [rand(dim,18) for _ in 1:1024]
Gρq = [rand(18,dim) for _ in 1:1024]
Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
@show dim
out1 = @btime multiplyAndSum_tullio($Gmρ_tensor, $Gρq_tensor)
out2 = @btime multiplyAndSum($Gmρ, $Gρq)
@show out1 ≈ out2
println()
end
dim = 50
1.556 ms (74 allocations: 24.88 KiB)
12.235 ms (4095 allocations: 39.27 MiB)
out1 ≈ out2 = true
dim = 500
74.228 ms (76 allocations: 1.91 MiB)
721.703 ms (4095 allocations: 3.81 GiB)
out1 ≈ out2 = true
dim = 1000
308.981 ms (77 allocations: 7.63 MiB)
6.334 s (4095 allocations: 15.25 GiB)
out1 ≈ out2 = true
```

A couple notes:

```
Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
```

are just me turning your arrays of arrays into equivalent 3D tensors. It’s best to generate them in this form, *not* turn them into this form after creating them as that would be inefficient.

- The Tullio tensor contraction I wrote:

```
multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
```

essentially says create a 2D matrix `out`

whose `[i,j]`

th element is the sum over the indices `k`

and `l`

of `Gmρ[i, k, l] * Gρq[k, j, l]`

.