# Multiply dense arrays

Is there a way to speed up basic multiplication of dense arrays? I have the following:

``````using BenchmarkTools

multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)

dim = 500 #target is 3000
Gmρ = [rand(dim,18) for _ in 1:1024]
Gρq = [rand(18,dim) for _ in 1:1024]

@btime multiplyAndSum(Gmρ,Gρq)
``````

with `dim = 50` I get:
2.129 seconds (116.26 k allocations: 3.820 GiB, 16.75% gc time)

with `dim = 3000` I get:
`414.674 seconds (116.30 k allocations: 137.335 GiB, 29.32% gc time)`

1 Like

Use a GPU

2 Likes

A better strategy than your current one would be to not use an array-of-arrays and instead write this as a 3-d tensor contraction. BLAS does not support that, but Tullio.jl + LoopVectorization.jl is basically just as good if not better.

``````julia> using Tullio, LoopVectorization, BenchmarkTools

julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)

julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)
``````
``````julia> foreach([50, 500, 1000]) do dim
Gmρ = [rand(dim,18) for _ in 1:1024]
Gρq = [rand(18,dim) for _ in 1:1024]
Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
@show dim
out1 = @btime multiplyAndSum_tullio(\$Gmρ_tensor, \$Gρq_tensor)
out2 = @btime multiplyAndSum(\$Gmρ, \$Gρq)
@show out1 ≈ out2
println()
end

dim = 50
1.556 ms (74 allocations: 24.88 KiB)
12.235 ms (4095 allocations: 39.27 MiB)
out1 ≈ out2 = true

dim = 500
74.228 ms (76 allocations: 1.91 MiB)
721.703 ms (4095 allocations: 3.81 GiB)
out1 ≈ out2 = true

dim = 1000
308.981 ms (77 allocations: 7.63 MiB)
6.334 s (4095 allocations: 15.25 GiB)
out1 ≈ out2 = true

``````

A couple notes:

• you need to do the `using LoopVectorization` part to get high performance here.

• The lines

``````           Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
``````

are just me turning your arrays of arrays into equivalent 3D tensors. It’s best to generate them in this form, not turn them into this form after creating them as that would be inefficient.

• The Tullio tensor contraction I wrote:
`````` multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
``````

essentially says create a 2D matrix `out` whose `[i,j]`th element is the sum over the indices `k` and `l` of `Gmρ[i, k, l] * Gρq[k, j, l]`.

5 Likes

If I’m understanding correctly, it looks like `multiplyAndSum` may also be formulated as a product of block matrices, i.e.,

``````multiplyAndSum2(Gmρ,Gρq) = hcat(Gmρ...)*vcat(Gρq...)
``````

Likely isn’t the most efficient way to go about this, but on my machine gets some speed-up!

``````julia> multiplyAndSum(Gmρ,Gρq) ≈ multiplyAndSum2(Gmρ,Gρq)
true

julia> @btime multiplyAndSum(\$Gmρ,\$Gρq);
2.507 s (4095 allocations: 3.81 GiB)

julia> @btime multiplyAndSum2(\$Gmρ,\$Gρq);
108.533 ms (22 allocations: 142.64 MiB)
``````

(these are for `dims=500` from your MWE)

Edit: Just saw @Mason’s post above using `Tullio` and `LoopVectorization` - really cool! Was curious (I’m not super familiar with these packages yet) so was eager to try it out and compare with the block matrix approach:

``````julia> using Tullio, LoopVectorization, BenchmarkTools

julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)

julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)

julia> foreach([50, 500, 1000]) do dim
Gmρ = [rand(dim,18) for _ in 1:1024]
Gρq = [rand(18,dim) for _ in 1:1024]
Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
@show dim
out1 = @btime multiplyAndSum_tullio(\$Gmρ_tensor, \$Gρq_tensor)
out2 = @btime multiplyAndSum(\$Gmρ, \$Gρq)
@show out1 ≈ out2
Gmρ_block = hcat(Gmρ...)
Gρq_block = vcat(Gρq...)
out3 = @btime *(\$Gmρ_block, \$Gρq_block)
@show out1 ≈ out3
println()
end
dim = 50
3.042 ms (2 allocations: 19.64 KiB)
9.694 ms (4095 allocations: 39.27 MiB)
out1 ≈ out2 = true
1.390 ms (2 allocations: 19.64 KiB)
out1 ≈ out3 = true

dim = 500
290.432 ms (2 allocations: 1.91 MiB)
2.566 s (4095 allocations: 3.81 GiB)
out1 ≈ out2 = true
67.066 ms (2 allocations: 1.91 MiB)
out1 ≈ out3 = true

dim = 1000
1.125 s (2 allocations: 7.63 MiB)
6.858 s (4095 allocations: 15.25 GiB)
out1 ≈ out2 = true
276.525 ms (2 allocations: 7.63 MiB)
out1 ≈ out3 = true
``````

Not sure this is an apples to apples comparison though, since (if I recall) the BLAS multiply is multi-threaded and I’m not sure if I’m calling `multiplyAndSum_tullio` correctly to get the same benefit…
Seems it was not! The two approaches look much more similar in @Mason’s results below.

3 Likes

`reduce(hcat, v)` and `reduce(vcat, v)` are actually optimized to be much more performant than `hcat(v...)` and `vcat(v...)`.

Here are the numbers I see comparing an optimized version of your solution (and a version where the `_cat` doesn’t happen at runtime) to the Tullio version:

``````julia> multiplyAndSum2(Gmρ,Gρq) = reduce(hcat, Gmρ)*reduce(vcat, Gρq)
multiplyAndSum2 (generic function with 1 method)

julia> multiplyAndSum3(Gmρ_block,Gρq_block) = Gmρ_block*Gρq_block
multiplyAndSum3 (generic function with 1 method)

julia> let dim = 500
Gmρ = [rand(dim,18) for _ in 1:1024]
Gρq = [rand(18,dim) for _ in 1:1024]
Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)

Gmρ_block = reduce(hcat, Gmρ)
Gρq_block = reduce(vcat, Gρq)

out1 = @btime multiplyAndSum_tullio(\$Gmρ_tensor, \$Gρq_tensor)
out2 = @btime multiplyAndSum(\$Gmρ, \$Gρq)
out3 = @btime multiplyAndSum2(\$Gmρ, \$Gρq)
out4 = @btime multiplyAndSum3(\$Gmρ_block, \$Gρq_block)
@show out1 ≈ out2 ≈ out3
end
71.913 ms (76 allocations: 1.91 MiB)
1.241 s (4095 allocations: 3.81 GiB)
118.128 ms (6 allocations: 142.53 MiB)
70.622 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 = true
``````

If the `cat`ing is hoisted out, your solution is ever so slightly faster. Nice!

3 Likes

Oh! Haha you beat me to it! Great to know about `reduce(hcat, v)` vs `hcat(v...)` - thanks!

Also, looking at your numbers, I’m pretty convinced I wasn’t calling `multiplyAndSum_tullio` correctly. Will edit my post to make note of that.

Did I need to launch julia multi-threaded? Sorry, I’m not too familiar with `Tullio` and `LoopVectorization` - though you are doing a great job of motivating me to learn!

2 Likes

Jinx!

Tullio is multithreaded, but it uses Julia’s composable tasked based multithreading. This means you can call multi-threaded tullio functions from other multithreaded functions without a catastrophic loss of performance (unlike BLAS), but it also means you need to tell julia to start itself with multiple threads.

If you’re julia 1.5, you can do this with the `-t` flag:

``````[mason@mason-pc ~]\$ julia -t6
_
_       _ _(_)_     |  Documentation: https://docs.julialang.org
(_)     | (_) (_)    |
_ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` |  |
| | |_| | | | (_| |  |  Version 1.5.2 (2020-09-23)
_/ |\__'_|_|_|\__'_|  |
|__/                   |

6
``````

otherwise, you’ll have to use the `JULIA_NUM_THREADS` environment variable:

``````[mason@mason-pc ~]\$ JULIA_NUM_THREADS=6 julia
_
_       _ _(_)_     |  Documentation: https://docs.julialang.org
(_)     | (_) (_)    |
_ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` |  |
| | |_| | | | (_| |  |  Version 1.5.2 (2020-09-23)
_/ |\__'_|_|_|\__'_|  |
|__/                   |

6

``````
3 Likes

Aha! I thought that might be the case - very cool. Thanks for explaining!

Side note: It’s really neat to see how well a flexible approach like `Tullio` does here - impressive!

3 Likes

Excellent. Thanks all! Awesome community here.

3 Likes

Here’s a pretty straigthforward solution, that doesn’t explicitly use threading, but still gives close to 10x speedup:

``````using LinearAlgebra

function mulsum(A, B)
out = first(A) * first(B)
temp = similar(out)
for i in eachindex(A, B)[2:end]
mul!(temp, A[i], B[i])
out .+= temp
end
return out
end
``````

I didn’t bother with BenchmarkTools for once, since it just takes too long.
(After warmup):

``````julia> @time multiplyAndSum(Gmp, Gpq);
15.742944 seconds (4.09 k allocations: 15.251 GiB, 13.37% gc time)

julia> @time mulsum(Gmp, Gpq);
1.842141 seconds (4 allocations: 15.259 MiB, 0.32% gc time)
``````

The point is to avoid unnecessary allocations, by doing the multiplications in-place with `LinearAlgebra.mul!`

3 Likes

Even better, you can use `mul!(out, A[i], B[i], true, true)` which avoids needing `temp`. This is the fastest variant for me:

``````julia> @btime mulsum_5arg(\$Gmρ, \$Gρq);
126.844 ms (2 allocations: 1.91 MiB)

julia> @btime mulsum(\$Gmρ, \$Gρq);
288.329 ms (4 allocations: 3.81 MiB)

julia> @btime multiplyAndSum_tullio(\$Gmρ_tensor, \$Gρq_tensor);
184.888 ms (52 allocations: 1.91 MiB)

julia> @time multiplyAndSum(Gmρ, Gρq);
5.014933 seconds (4.09 k allocations: 3.813 GiB, 16.50% gc time)

julia> BLAS.vendor() # using MKL.jl, on a 2-core intel laptop
:mkl

julia> size(Gmρ_tensor) # sizes as in original question
(500, 18, 1024)
``````
4 Likes

Tullio wins over the range of 50-3000 on my computer:

``````julia> using Tullio, LoopVectorization, BenchmarkTools, LinearAlgebra

:mkl

julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)

julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)

julia> function mulsum(A::AbstractArray{<:Any,3}, B::AbstractArray{<:Any,3})
axis3 = axes(A,3)
@assert (axis3 == axes(B,3)) && !isempty(axis3)
out = @view(A[:,:,begin]) * @view(B[:,:,begin])
for i in first(axis3)+1:last(axis3)
@views mul!(out, A[:,:,i], B[:,:,i], true, true)
end
out
end
mulsum (generic function with 1 method)

julia> function mulsum(A::AbstractVector{<:AbstractMatrix}, B::AbstractVector{<:AbstractVecOrMat})
out = first(A) * first(B)
for i in eachindex(A, B)[2:end]
mul!(out, A[i], B[i], true, true)
end
return out
end
mulsum (generic function with 2 methods)

julia> foreach([50,100,150,200,250,500,750,1000,3000]) do dim
Gmρ = rand(dim,18,1024); Gmρ_array = collect(eachslice(Gmρ, dims=3));
Gρq = rand(18,dim,1024); Gρq_array = collect(eachslice(Gρq, dims=3));
@show dim
out1 = @btime multiplyAndSum_tullio(\$Gmρ, \$Gρq)
out2 = @btime multiplyAndSum(\$(eachslice(Gmρ, dims=3)), \$(eachslice(Gρq, dims=3)))
out3 = @btime mulsum(\$Gmρ, \$Gρq)
out4 = @btime mulsum(\$Gmρ_array, \$Gρq_array)
@show out1 ≈ out2 ≈ out3 ≈ out4
println()
end
dim = 50
494.930 μs (243 allocations: 37.38 KiB)
6.093 ms (4099 allocations: 39.36 MiB)
1.731 ms (2 allocations: 19.64 KiB)
1.738 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
1.459 ms (244 allocations: 95.97 KiB)
20.063 ms (4099 allocations: 156.43 MiB)
10.355 ms (2 allocations: 78.20 KiB)
11.484 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
2.328 ms (244 allocations: 193.66 KiB)
38.352 ms (4099 allocations: 351.71 MiB)
20.922 ms (2 allocations: 175.89 KiB)
19.467 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
4.849 ms (246 allocations: 330.41 KiB)
66.895 ms (4099 allocations: 624.95 MiB)
9.340 ms (2 allocations: 312.58 KiB)
9.316 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
5.737 ms (246 allocations: 506.22 KiB)
114.354 ms (4099 allocations: 976.41 MiB)
31.950 ms (2 allocations: 488.39 KiB)
32.034 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
15.597 ms (245 allocations: 1.92 MiB)
540.475 ms (4099 allocations: 3.81 GiB)
34.431 ms (2 allocations: 1.91 MiB)
34.891 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
29.306 ms (244 allocations: 4.31 MiB)
1.631 s (4099 allocations: 8.58 GiB)
79.687 ms (2 allocations: 4.29 MiB)
78.346 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 1000
55.742 ms (244 allocations: 7.65 MiB)
2.616 s (4099 allocations: 15.25 GiB)
66.634 ms (2 allocations: 7.63 MiB)
66.764 ms (2 allocations: 7.63 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 3000
448.724 ms (249 allocations: 68.68 MiB)
32.772 s (4099 allocations: 137.26 GiB)
1.496 s (2 allocations: 68.66 MiB)
1.517 s (2 allocations: 68.66 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
``````
3 Likes

Isn’t that just because of avx512? For most people, other tools will probably well as well.

1 Like

Not sure what exactly you mean.
LoopVectorization and MKL both use AVX512.
As do OpenBLAS on Julia `VERSION >= v"1.5"`, and `sum` & `.*` on Julia `VERSION < v"1.6"`.

EDIT:
Although if you mean that most people’s CPUs will probably be more similar to @mcabbott’s, and that they should expect to see result’s closer to his, that is a fair point.

2 Likes

My guess is that it multi-threads better, as many of the individual slice-multiplications aren’t big enough to benefit much, but the total problem is. On my 6-core desktop (without AVX512) I see similar numbers to Chris’s (although with OpenBLAS). While on my 2-core laptop, sequential `mul!` is fastest from size 150 on.

Edit – now I tried with MKL, and sequential `mul!` catches up. So I guess it’s about OpenBLAS being bad at threads.

``````julia> BLAS.set_num_threads(18); BLAS.vendor()
:openblas64

dim = 50
625.519 μs (74 allocations: 24.64 KiB)
9.591 ms (4099 allocations: 39.36 MiB)
2.991 ms (2 allocations: 19.64 KiB)
2.997 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
2.253 ms (74 allocations: 83.20 KiB)
41.021 ms (4099 allocations: 156.43 MiB)
9.755 ms (2 allocations: 78.20 KiB)
9.754 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
5.251 ms (75 allocations: 180.92 KiB)
363.106 ms (4099 allocations: 351.71 MiB)
71.963 ms (2 allocations: 175.89 KiB)
71.158 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
7.870 ms (76 allocations: 317.64 KiB)
572.025 ms (4099 allocations: 624.95 MiB)
72.760 ms (2 allocations: 312.58 KiB)
72.461 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
12.850 ms (76 allocations: 493.45 KiB)
799.994 ms (4099 allocations: 976.41 MiB)
80.468 ms (2 allocations: 488.39 KiB)
82.289 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
45.773 ms (76 allocations: 1.91 MiB)
3.100 s (4099 allocations: 3.81 GiB)
118.703 ms (2 allocations: 1.91 MiB)
120.825 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
119.873 ms (76 allocations: 4.30 MiB)
5.366 s (4099 allocations: 8.58 GiB)
212.210 ms (2 allocations: 4.29 MiB)
217.976 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
``````

First is `@tullio`, last two are `mul!` on slices.

Above, OpenBLAS, below, MKL.

``````julia> BLAS.set_num_threads(18); BLAS.vendor()
:mkl

dim = 50
625.765 μs (75 allocations: 24.67 KiB)
8.207 ms (4099 allocations: 39.36 MiB)
2.257 ms (2 allocations: 19.64 KiB)
2.245 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
2.262 ms (74 allocations: 83.20 KiB)
34.287 ms (4099 allocations: 156.43 MiB)
4.756 ms (2 allocations: 78.20 KiB)
4.816 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
5.141 ms (76 allocations: 180.95 KiB)
152.586 ms (4099 allocations: 351.71 MiB)
5.740 ms (2 allocations: 175.89 KiB)
5.727 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
7.957 ms (76 allocations: 317.64 KiB)
254.642 ms (4099 allocations: 624.95 MiB)
8.018 ms (2 allocations: 312.58 KiB)
8.038 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
12.834 ms (76 allocations: 493.45 KiB)
384.841 ms (4099 allocations: 976.41 MiB)
10.891 ms (2 allocations: 488.39 KiB)
10.981 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
47.541 ms (76 allocations: 1.91 MiB)
1.857 s (4099 allocations: 3.81 GiB)
39.667 ms (2 allocations: 1.91 MiB)
39.889 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
120.173 ms (76 allocations: 4.30 MiB)
3.254 s (4099 allocations: 8.58 GiB)
71.169 ms (2 allocations: 4.29 MiB)
70.832 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 1000
186.777 ms (77 allocations: 7.63 MiB)
5.054 s (4099 allocations: 15.25 GiB)
121.888 ms (2 allocations: 7.63 MiB)
122.270 ms (2 allocations: 7.63 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
``````
3 Likes