Multiply dense arrays

Is there a way to speed up basic multiplication of dense arrays? I have the following:

using BenchmarkTools

multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)

dim = 500 #target is 3000
Gmρ = [rand(dim,18) for _ in 1:1024]
Gρq = [rand(18,dim) for _ in 1:1024]

@btime multiplyAndSum(Gmρ,Gρq)

with dim = 50 I get:
2.129 seconds (116.26 k allocations: 3.820 GiB, 16.75% gc time)

with dim = 3000 I get:
414.674 seconds (116.30 k allocations: 137.335 GiB, 29.32% gc time)

1 Like

Use a GPU

2 Likes

A better strategy than your current one would be to not use an array-of-arrays and instead write this as a 3-d tensor contraction. BLAS does not support that, but Tullio.jl + LoopVectorization.jl is basically just as good if not better.

julia> using Tullio, LoopVectorization, BenchmarkTools

julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)

julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)
julia> foreach([50, 500, 1000]) do dim
           Gmρ = [rand(dim,18) for _ in 1:1024]
           Gρq = [rand(18,dim) for _ in 1:1024]
           Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
           Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
           @show dim
           out1 = @btime multiplyAndSum_tullio($Gmρ_tensor, $Gρq_tensor)
           out2 = @btime multiplyAndSum($Gmρ, $Gρq)
           @show out1 ≈ out2
           println()
       end
        
dim = 50
  1.556 ms (74 allocations: 24.88 KiB)
  12.235 ms (4095 allocations: 39.27 MiB)
  out1 ≈ out2 = true


dim = 500
  74.228 ms (76 allocations: 1.91 MiB)
  721.703 ms (4095 allocations: 3.81 GiB)
  out1 ≈ out2 = true


dim = 1000
  308.981 ms (77 allocations: 7.63 MiB)
  6.334 s (4095 allocations: 15.25 GiB)
  out1 ≈ out2 = true

A couple notes:

  • you need to do the using LoopVectorization part to get high performance here.

  • The lines

           Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
           Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)

are just me turning your arrays of arrays into equivalent 3D tensors. It’s best to generate them in this form, not turn them into this form after creating them as that would be inefficient.

  • The Tullio tensor contraction I wrote:
 multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]

essentially says create a 2D matrix out whose [i,j]th element is the sum over the indices k and l of Gmρ[i, k, l] * Gρq[k, j, l].

5 Likes

If I’m understanding correctly, it looks like multiplyAndSum may also be formulated as a product of block matrices, i.e.,

multiplyAndSum2(Gmρ,Gρq) = hcat(Gmρ...)*vcat(Gρq...)

Likely isn’t the most efficient way to go about this, but on my machine gets some speed-up! :slight_smile:

julia> multiplyAndSum(Gmρ,Gρq) ≈ multiplyAndSum2(Gmρ,Gρq)
true

julia> @btime multiplyAndSum($Gmρ,$Gρq);
  2.507 s (4095 allocations: 3.81 GiB)

julia> @btime multiplyAndSum2($Gmρ,$Gρq);
  108.533 ms (22 allocations: 142.64 MiB)

(these are for dims=500 from your MWE)


Edit: Just saw @Mason’s post above using Tullio and LoopVectorization - really cool! :slight_smile: Was curious (I’m not super familiar with these packages yet) so was eager to try it out and compare with the block matrix approach:

julia> using Tullio, LoopVectorization, BenchmarkTools

julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)

julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)

julia> foreach([50, 500, 1000]) do dim
           Gmρ = [rand(dim,18) for _ in 1:1024]
           Gρq = [rand(18,dim) for _ in 1:1024]
           Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
           Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
           @show dim
           out1 = @btime multiplyAndSum_tullio($Gmρ_tensor, $Gρq_tensor)
           out2 = @btime multiplyAndSum($Gmρ, $Gρq)
           @show out1 ≈ out2
           Gmρ_block = hcat(Gmρ...)
           Gρq_block = vcat(Gρq...)
           out3 = @btime *($Gmρ_block, $Gρq_block)
           @show out1 ≈ out3
           println()
       end
dim = 50
  3.042 ms (2 allocations: 19.64 KiB)
  9.694 ms (4095 allocations: 39.27 MiB)
out1 ≈ out2 = true
  1.390 ms (2 allocations: 19.64 KiB)
out1 ≈ out3 = true

dim = 500
  290.432 ms (2 allocations: 1.91 MiB)
  2.566 s (4095 allocations: 3.81 GiB)
out1 ≈ out2 = true
  67.066 ms (2 allocations: 1.91 MiB)
out1 ≈ out3 = true

dim = 1000
  1.125 s (2 allocations: 7.63 MiB)
  6.858 s (4095 allocations: 15.25 GiB)
out1 ≈ out2 = true
  276.525 ms (2 allocations: 7.63 MiB)
out1 ≈ out3 = true

Not sure this is an apples to apples comparison though, since (if I recall) the BLAS multiply is multi-threaded and I’m not sure if I’m calling multiplyAndSum_tullio correctly to get the same benefit…
Seems it was not! The two approaches look much more similar in @Mason’s results below. :slight_smile:

3 Likes

reduce(hcat, v) and reduce(vcat, v) are actually optimized to be much more performant than hcat(v...) and vcat(v...).

Here are the numbers I see comparing an optimized version of your solution (and a version where the _cat doesn’t happen at runtime) to the Tullio version:

julia> multiplyAndSum2(Gmρ,Gρq) = reduce(hcat, Gmρ)*reduce(vcat, Gρq)
multiplyAndSum2 (generic function with 1 method)

julia> multiplyAndSum3(Gmρ_block,Gρq_block) = Gmρ_block*Gρq_block
multiplyAndSum3 (generic function with 1 method)

julia> let dim = 500
           Gmρ = [rand(dim,18) for _ in 1:1024]
           Gρq = [rand(18,dim) for _ in 1:1024]
           Gmρ_tensor = reduce((x, y) -> cat(x, y, dims=3), Gmρ)
           Gρq_tensor = reduce((x, y) -> cat(x, y, dims=3), Gρq)
           
           Gmρ_block = reduce(hcat, Gmρ)
           Gρq_block = reduce(vcat, Gρq)
           
           out1 = @btime multiplyAndSum_tullio($Gmρ_tensor, $Gρq_tensor)
           out2 = @btime multiplyAndSum($Gmρ, $Gρq)
           out3 = @btime multiplyAndSum2($Gmρ, $Gρq)
           out4 = @btime multiplyAndSum3($Gmρ_block, $Gρq_block) 
           @show out1 ≈ out2 ≈ out3
       end
  71.913 ms (76 allocations: 1.91 MiB)
  1.241 s (4095 allocations: 3.81 GiB)
  118.128 ms (6 allocations: 142.53 MiB)
  70.622 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 = true

If the cating is hoisted out, your solution is ever so slightly faster. Nice!

3 Likes

Oh! Haha you beat me to it! Great to know about reduce(hcat, v) vs hcat(v...) - thanks!

Also, looking at your numbers, I’m pretty convinced I wasn’t calling multiplyAndSum_tullio correctly. Will edit my post to make note of that.

Did I need to launch julia multi-threaded? Sorry, I’m not too familiar with Tullio and LoopVectorization - though you are doing a great job of motivating me to learn! :wink:

2 Likes

Jinx!

Tullio is multithreaded, but it uses Julia’s composable tasked based multithreading. This means you can call multi-threaded tullio functions from other multithreaded functions without a catastrophic loss of performance (unlike BLAS), but it also means you need to tell julia to start itself with multiple threads.

If you’re julia 1.5, you can do this with the -t flag:

[mason@mason-pc ~]$ julia -t6
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.2 (2020-09-23)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |

julia> Threads.nthreads()
6

otherwise, you’ll have to use the JULIA_NUM_THREADS environment variable:

[mason@mason-pc ~]$ JULIA_NUM_THREADS=6 julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.2 (2020-09-23)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |

julia> Threads.nthreads()
6

3 Likes

Aha! I thought that might be the case - very cool. Thanks for explaining! :slight_smile:

Side note: It’s really neat to see how well a flexible approach like Tullio does here - impressive!

3 Likes

Excellent. Thanks all! Awesome community here.

3 Likes

Here’s a pretty straigthforward solution, that doesn’t explicitly use threading, but still gives close to 10x speedup:

using LinearAlgebra

function mulsum(A, B)
    out = first(A) * first(B)
    temp = similar(out)
    for i in eachindex(A, B)[2:end]
        mul!(temp, A[i], B[i])
        out .+= temp
    end
    return out
end

I didn’t bother with BenchmarkTools for once, since it just takes too long.
(After warmup):

julia> @time multiplyAndSum(Gmp, Gpq);
 15.742944 seconds (4.09 k allocations: 15.251 GiB, 13.37% gc time)

julia> @time mulsum(Gmp, Gpq);
  1.842141 seconds (4 allocations: 15.259 MiB, 0.32% gc time)

The point is to avoid unnecessary allocations, by doing the multiplications in-place with LinearAlgebra.mul!

3 Likes

Even better, you can use mul!(out, A[i], B[i], true, true) which avoids needing temp. This is the fastest variant for me:

julia> @btime mulsum_5arg($Gmρ, $Gρq);
  126.844 ms (2 allocations: 1.91 MiB)

julia> @btime mulsum($Gmρ, $Gρq);
  288.329 ms (4 allocations: 3.81 MiB)

julia> @btime multiplyAndSum_tullio($Gmρ_tensor, $Gρq_tensor);
  184.888 ms (52 allocations: 1.91 MiB)

julia> @time multiplyAndSum(Gmρ, Gρq);
  5.014933 seconds (4.09 k allocations: 3.813 GiB, 16.50% gc time)

julia> BLAS.vendor() # using MKL.jl, on a 2-core intel laptop
:mkl

julia> size(Gmρ_tensor) # sizes as in original question
(500, 18, 1024)
4 Likes

Tullio wins over the range of 50-3000 on my computer:

julia> using Tullio, LoopVectorization, BenchmarkTools, LinearAlgebra

julia> BLAS.set_num_threads(18); BLAS.vendor()
:mkl

julia> multiplyAndSum_tullio(Gmρ, Gρq) = @tullio out[i, j] := Gmρ[i, k, l] * Gρq[k, j, l]
multiplyAndSum_tullio (generic function with 1 method)

julia> multiplyAndSum(Gmρ,Gρq) = sum(Gmρ .* Gρq)
multiplyAndSum (generic function with 1 method)

julia> function mulsum(A::AbstractArray{<:Any,3}, B::AbstractArray{<:Any,3})
           axis3 = axes(A,3)
           @assert (axis3 == axes(B,3)) && !isempty(axis3)
           out = @view(A[:,:,begin]) * @view(B[:,:,begin])
           for i in first(axis3)+1:last(axis3)
               @views mul!(out, A[:,:,i], B[:,:,i], true, true)
           end
           out
       end
mulsum (generic function with 1 method)

julia> function mulsum(A::AbstractVector{<:AbstractMatrix}, B::AbstractVector{<:AbstractVecOrMat})
           out = first(A) * first(B)
           for i in eachindex(A, B)[2:end]
               mul!(out, A[i], B[i], true, true)
           end
           return out
       end
mulsum (generic function with 2 methods)

julia> foreach([50,100,150,200,250,500,750,1000,3000]) do dim
           Gmρ = rand(dim,18,1024); Gmρ_array = collect(eachslice(Gmρ, dims=3));
           Gρq = rand(18,dim,1024); Gρq_array = collect(eachslice(Gρq, dims=3));
           @show dim
           out1 = @btime multiplyAndSum_tullio($Gmρ, $Gρq)
           out2 = @btime multiplyAndSum($(eachslice(Gmρ, dims=3)), $(eachslice(Gρq, dims=3)))
           out3 = @btime mulsum($Gmρ, $Gρq)
           out4 = @btime mulsum($Gmρ_array, $Gρq_array)
           @show out1 ≈ out2 ≈ out3 ≈ out4
           println()
       end
dim = 50
  494.930 μs (243 allocations: 37.38 KiB)
  6.093 ms (4099 allocations: 39.36 MiB)
  1.731 ms (2 allocations: 19.64 KiB)
  1.738 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
  1.459 ms (244 allocations: 95.97 KiB)
  20.063 ms (4099 allocations: 156.43 MiB)
  10.355 ms (2 allocations: 78.20 KiB)
  11.484 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
  2.328 ms (244 allocations: 193.66 KiB)
  38.352 ms (4099 allocations: 351.71 MiB)
  20.922 ms (2 allocations: 175.89 KiB)
  19.467 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
  4.849 ms (246 allocations: 330.41 KiB)
  66.895 ms (4099 allocations: 624.95 MiB)
  9.340 ms (2 allocations: 312.58 KiB)
  9.316 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
  5.737 ms (246 allocations: 506.22 KiB)
  114.354 ms (4099 allocations: 976.41 MiB)
  31.950 ms (2 allocations: 488.39 KiB)
  32.034 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
  15.597 ms (245 allocations: 1.92 MiB)
  540.475 ms (4099 allocations: 3.81 GiB)
  34.431 ms (2 allocations: 1.91 MiB)
  34.891 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
  29.306 ms (244 allocations: 4.31 MiB)
  1.631 s (4099 allocations: 8.58 GiB)
  79.687 ms (2 allocations: 4.29 MiB)
  78.346 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 1000
  55.742 ms (244 allocations: 7.65 MiB)
  2.616 s (4099 allocations: 15.25 GiB)
  66.634 ms (2 allocations: 7.63 MiB)
  66.764 ms (2 allocations: 7.63 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 3000
  448.724 ms (249 allocations: 68.68 MiB)
  32.772 s (4099 allocations: 137.26 GiB)
  1.496 s (2 allocations: 68.66 MiB)
  1.517 s (2 allocations: 68.66 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
3 Likes

Isn’t that just because of avx512? For most people, other tools will probably well as well.

1 Like

Not sure what exactly you mean.
LoopVectorization and MKL both use AVX512.
As do OpenBLAS on Julia VERSION >= v"1.5", and sum & .* on Julia VERSION < v"1.6".

EDIT:
Although if you mean that most people’s CPUs will probably be more similar to @mcabbott’s, and that they should expect to see result’s closer to his, that is a fair point.

2 Likes

My guess is that it multi-threads better, as many of the individual slice-multiplications aren’t big enough to benefit much, but the total problem is. On my 6-core desktop (without AVX512) I see similar numbers to Chris’s (although with OpenBLAS). While on my 2-core laptop, sequential mul! is fastest from size 150 on.

Edit – now I tried with MKL, and sequential mul! catches up. So I guess it’s about OpenBLAS being bad at threads.

julia> BLAS.set_num_threads(18); BLAS.vendor()
:openblas64

dim = 50
  625.519 μs (74 allocations: 24.64 KiB)
  9.591 ms (4099 allocations: 39.36 MiB)
  2.991 ms (2 allocations: 19.64 KiB)
  2.997 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
  2.253 ms (74 allocations: 83.20 KiB)
  41.021 ms (4099 allocations: 156.43 MiB)
  9.755 ms (2 allocations: 78.20 KiB)
  9.754 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
  5.251 ms (75 allocations: 180.92 KiB)
  363.106 ms (4099 allocations: 351.71 MiB)
  71.963 ms (2 allocations: 175.89 KiB)
  71.158 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
  7.870 ms (76 allocations: 317.64 KiB)
  572.025 ms (4099 allocations: 624.95 MiB)
  72.760 ms (2 allocations: 312.58 KiB)
  72.461 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
  12.850 ms (76 allocations: 493.45 KiB)
  799.994 ms (4099 allocations: 976.41 MiB)
  80.468 ms (2 allocations: 488.39 KiB)
  82.289 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
  45.773 ms (76 allocations: 1.91 MiB)
  3.100 s (4099 allocations: 3.81 GiB)
  118.703 ms (2 allocations: 1.91 MiB)
  120.825 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
  119.873 ms (76 allocations: 4.30 MiB)
  5.366 s (4099 allocations: 8.58 GiB)
  212.210 ms (2 allocations: 4.29 MiB)
  217.976 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

First is @tullio, last two are mul! on slices.

Julia 1.5.2, i7-8700, 6 threads.

Above, OpenBLAS, below, MKL.

julia> BLAS.set_num_threads(18); BLAS.vendor()
:mkl

dim = 50
  625.765 μs (75 allocations: 24.67 KiB)
  8.207 ms (4099 allocations: 39.36 MiB)
  2.257 ms (2 allocations: 19.64 KiB)
  2.245 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
  2.262 ms (74 allocations: 83.20 KiB)
  34.287 ms (4099 allocations: 156.43 MiB)
  4.756 ms (2 allocations: 78.20 KiB)
  4.816 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
  5.141 ms (76 allocations: 180.95 KiB)
  152.586 ms (4099 allocations: 351.71 MiB)
  5.740 ms (2 allocations: 175.89 KiB)
  5.727 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
  7.957 ms (76 allocations: 317.64 KiB)
  254.642 ms (4099 allocations: 624.95 MiB)
  8.018 ms (2 allocations: 312.58 KiB)
  8.038 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
  12.834 ms (76 allocations: 493.45 KiB)
  384.841 ms (4099 allocations: 976.41 MiB)
  10.891 ms (2 allocations: 488.39 KiB)
  10.981 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
  47.541 ms (76 allocations: 1.91 MiB)
  1.857 s (4099 allocations: 3.81 GiB)
  39.667 ms (2 allocations: 1.91 MiB)
  39.889 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
  120.173 ms (76 allocations: 4.30 MiB)
  3.254 s (4099 allocations: 8.58 GiB)
  71.169 ms (2 allocations: 4.29 MiB)
  70.832 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 1000
  186.777 ms (77 allocations: 7.63 MiB)
  5.054 s (4099 allocations: 15.25 GiB)
  121.888 ms (2 allocations: 7.63 MiB)
  122.270 ms (2 allocations: 7.63 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
3 Likes