Multiply dense arrays

reduce(hcat, v) and reduce(vcat, v) are actually optimized to be much more performant than hcat(v...) and vcat(v...).

Here are the numbers I see comparing an optimized version of your solution (and a version where the _cat doesn’t happen at runtime) to the Tullio version:

julia> multiplyAndSum2(Gmρ,Gρq) = reduce(hcat, Gmρ)*reduce(vcat, Gρq)
multiplyAndSum2 (generic function with 1 method)

julia> multiplyAndSum3(Gmρ_block,Gρq_block) = Gmρ_block*Gρq_block
multiplyAndSum3 (generic function with 1 method)

julia> let dim = 500
           Gmρ = [rand(dim,18) for _ in 1:1024]
           Gρq = [rand(18,dim) for _ in 1:1024]
           Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
           Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
           
           Gmρ_block = reduce(hcat, Gmρ)
           Gρq_block = reduce(vcat, Gρq)
           
           out1 = @btime multiplyAndSum_tullio($Gmρ_tensor, $Gρq_tensor)
           out2 = @btime multiplyAndSum($Gmρ, $Gρq)
           out3 = @btime multiplyAndSum2($Gmρ, $Gρq)
           out4 = @btime multiplyAndSum3($Gmρ_block, $Gρq_block) 
           @show out1 ≈ out2 ≈ out3
       end
  71.913 ms (76 allocations: 1.91 MiB)
  1.241 s (4095 allocations: 3.81 GiB)
  118.128 ms (6 allocations: 142.53 MiB)
  70.622 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 = true

If the cating is hoisted out, your solution is ever so slightly faster. Nice!

4 Likes