 # Poor performance multiplying many (large) matrices multithreaded

I am trying to multiply a bunch of (large and differently sized) matrices, and then do some reduction following each multiply.

Basically I tried to do this in a multithreaded loop, where I see good speed gains for small-ish matrices, but performance quickly degrades for large-ish matrices.

``````using LinearAlgebra, BenchmarkTools, Random

function mult_reduce(M::Vector{Matrix{Float64}})
n = length(M)
s = 0.0
for i in 1:n
tmp = Transpose(M[i]) * M[i] # calculate Mi' Mi
s  += sum(tmp)               # reduction
end
return s
end

n = length(M)
tmp = Transpose(M[i]) * M[i]      # calculate Mi' Mi
end
return sum(s) # reduce across threads
end
``````

On small problems, I get 4~5x speedup with 8 physical cores:

``````# (small) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
k = rand(100:200) # random matrix dimension
M[i] = rand(k, k)
end

18.225 ms (200 allocations: 18.71 MiB)
1.0128044609863697e8

4.451 ms (259 allocations: 18.72 MiB)
1.0128044609863696e8
``````

``````# (larger) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
k = rand(1000:3000) # random matrix dimension
M[i] = rand(k, k)
end

5.987 s (200 allocations: 3.12 GiB)
2.368913048551288e11

21.042 s (200 allocations: 3.12 GiB)
2.368913048551288e11

9.937 s (275 allocations: 3.12 GiB)
2.3689130485512885e11
``````

Discussions in this post seem to suggest any allocation is terrible for multithreading since garbage collection is single-threaded.

Does this mean my application is doomed? I cannot preallocate those intermediate matrices (variable `tmp`) since each matrix is differently sized.

Please give me any suggestion to do these calculations. On my real problem, I am looking at 2x speed gains using 10 cores on a cluster. That isn’t as bad as this example but it is also quite disappointing.

The first thing to try is setting blas threads to 1. With big matrices, you’re probably over-allocating your CPU.

2 Likes

Yes, BLAS’s threads don’t play nicely with Julia’s threads (yet). Large matrix multiplication is already multithreaded but there’s a heuristic that turns it off for sufficiently small matrices. I don’t recall the exact cutoffs, but that may explain the differences here. Just use `BLAS.set_num_threads(1)` if you’re going to rely upon Julia’s threading at a higher level instead.

1 Like

Thank you for both of your replies. I updated my post to include timings from `BLAS.set_num_threads(1)` showing roughly 2x speedup on 8 cores. I have actually tried it in my application, which did not help.

2x speedup on 8 cores still seems poor, but I take this as “there’s nothing we can do at the moment”. Is this correct?

1 Like
``````using LoopVectorization, LinearAlgebra, Random, BenchmarkTools

function sumAᵀA(A::AbstractMatrix{T}) where {T}
s = zero(T)
# Unfortunately, this pattern in loops seems necessary for good performance.
# I'll look into why the more obvious
# `for m ∈ axes(A,1), n ∈ axes(A,1), k ∈ axes(A,2); s += ...`
# performs poorly later.
@avx for m ∈ axes(A,1), n ∈ axes(A,1)
t = zero(T)
for k ∈ axes(A,2)
t += A[m,k] * A[n,k]
end
s += t
end
s
end
basesumAᵀA(A) = sum(Transpose(A) * A)

function mult_reduce(f::F, M::Vector{Matrix{Float64}}) where {F}
n = length(M)
s = 0.0
for i in 1:n
s  += f(M[i])
end
return s
end

n = length(M)
end
return sum(s) # reduce across threads
end

Random.seed!(2020);
n = 100
M = Vector{Matrix{Float64}}(undef, n);
for i in 1:n
k = rand(100:200) # random matrix dimension
M[i] = rand(k, k)
end
``````

This yields:

``````julia> @btime mult_reduce(basesumAᵀA, \$M) # 1 thread
8.981 ms (200 allocations: 17.67 MiB)
9.15843796874148e7

julia> @btime mult_reduce(sumAᵀA, \$M) # 1 thread
7.832 ms (0 allocations: 0 bytes)
9.158351500138862e7

919.900 μs (292 allocations: 17.69 MiB)
9.158437968741481e7

648.190 μs (92 allocations: 13.39 KiB)
9.158351500138861e7
``````

How big are the arrays you need for the actual application?

Raw LoopVectorization will start performing pretty badly beyond a certain size for matrix multiplication. That size is heavily CPU dependent, and based on CPU cache.
For my newest/fastest computer (the one I ran these benchmarks on) that size is rather large, over 200x200. For an old laptop, it’s closer to 50x50.

You can address this with blocking strategies (i.e., perform the matrix multiplication blockwise, where you copy each block into a preallocated buffer first) or if they’re not too large, perhaps simply tiling the arrays (similar to blocking, but using views instead of copying memory).
If you do this, you can get 0 allocations per mul, and achieve better performance than BLAS + sum.

3 Likes

My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization? I can probably try the tiling strategy you suggested, thank you. It may partially alleviate my allocations too.

You should also note that 512 is a bad number for cpus. It means your vectors columns are a multiple of the L2 cache entry size, so if you aren’t a little careful, you can see tons of cache misses as a result of columns over-writing other columns. I’ve seen 10x performance drops from this before.

1 Like

Hmm I didn’t know this. Sorry to keep this thread alive, but can you explain why columns will be over-writing other columns or what that means here? Does that mean I should choose something like 511 instead?

I actually purposely chose 512 because, according to graphs like this, I thought matrix dimensions in multiples of 8 maximizes CPU flops.

This is a bit in the weeds, but led to the weirdest performance bug I’ve ever seen. I’m going to use zen 2 processors as an example, but intel is very similar here.

The L2 cache on this processor is 512 KB, stored in 1024 buckets of 64 bytes each. If you have a matrix with 128 rows of float64s, this means that the [1,1] is 1024 bytes in memory away from [2,1]. This means that these end up in the same cache line. By itself, this isn’t a problem since the cache is 8-way associative, but all entries in the first row end up trying to fit in the same 8 cache lies, causing lots of problems. As a result, blas typically uses panels with one of the numbers prime to ensure that when you are accessing one of the matrices in the wrong order for memory, you don’t get collisions.

1 Like

Since the matrices are of very different sizes the fixed allocation strategy of `Threads.@threads` might not be very good. I have had some good experiences with using `Channel` for the work items and `Threads.@threads for _ in 1:Threads.nthreads()` loop to process them. There is also ThreadsPools.jl.

I think it is possible to use a too large `tmp` matrix and views to avoid memory allocations. Especially if the first dimension is always fixed you will end up with contiguous memory for the view of tmp.

Actually, I don’t think this should necessarily be a problem here.
512:

``````julia> using LoopVectorization, BenchmarkTools

julia> function AmulB!(C, A, B)
@avx for m ∈ axes(C,1), n ∈ axes(C,2)
Cₘₙ = zero(eltype(C))
for k ∈ axes(B,1)
Cₘₙ += A[m,k] * B[k,n]
end
C[m,n] = Cₘₙ
end
end
AmulB! (generic function with 1 method)

julia> M, K, N = 512, 72, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!(\$C1, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     44.239 μs (0.00% GC)
median time:      44.365 μs (0.00% GC)
mean time:        44.423 μs (0.00% GC)
maximum time:     120.546 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> 2e-9M*K*N / 44.239e-6
119.9940324148376
``````

vs 520:

``````julia> M, K, N = 520, 72, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!(\$C1, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     43.211 μs (0.00% GC)
median time:      43.345 μs (0.00% GC)
mean time:        43.409 μs (0.00% GC)
maximum time:     104.449 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> 2e-9M*K*N / 43.211e-6
124.7682303117262
``````

The CPU I ran this on (10980XE) has 8-way associative 32 KiB L1D cache/core

``````julia> 32 * (2 ^ 10) ÷ (8 * 8)
512
``````

Yielding the fabled 512 as the magic(-ally bad) number.
But 119 (M = 512) vs 124 (M = 520) GFLOPS isn’t a big deal.
Both are very good; the theoretical peak for this CPU is 131.2.

Agner Fog’s example (see section 9.10, starting on page 103) is transposing a matrix.
We can try this:

``````julia> A = rand(512, 512); B = similar(A);

julia> @benchmark copyto!(\$B, \$(A'))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     844.904 μs (0.00% GC)
median time:      850.551 μs (0.00% GC)
mean time:        850.523 μs (0.00% GC)
maximum time:     1.722 ms (0.00% GC)
--------------
samples:          5876
evals/sample:     1

julia> A = rand(520, 520); B = similar(A);

julia> @benchmark copyto!(\$B, \$(A'))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     465.476 μs (0.00% GC)
median time:      470.590 μs (0.00% GC)
mean time:        470.852 μs (0.00% GC)
maximum time:     873.865 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

Tranposing a 512x512 matrix took about 80% longer than transposing a 520x520 matrix on my computer. Ouch!
(The problem is much less severe if you use FastTranspose.jl, where I get 285 microseconds for 512x512 vs 235 for 520x520.)

That is because what were contiguous loads (down a column) from `A` suddenly become 512 doubles apart when storing into `B`, so that each one evicts the previous cacheline.

So, to reproduce this in matrix multiplication, we should similarly expect pathological behavior when multiplying column-major `A * B` without packing into a tile-major format.
This is because (in column-major matmul) on each iteration of the `k` loop, we load vectors from a column of `A`, and a bunch of scalars from a row of `B`. So if `B`'s stride is 512, each of these scalars will be from a different cacheline.

Yet, performance is still fine:

``````julia> M, K, N = 72, 512, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N); C2 = A * B;

julia> @benchmark AmulB!(\$C1, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     43.920 μs (0.00% GC)
median time:      44.285 μs (0.00% GC)
mean time:        44.367 μs (0.00% GC)
maximum time:     120.774 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> 2e-9M*K*N / 43.920e-6
120.86557377049182
``````

That’s 120 GFLOPS. I’m happy to see great performance, but it seems like I need to understand this problem better.

@biona001

My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization?

At that size, why not multithread the BLAS and sum? How does `ThreadsX.sum(A * A')` or simple `sum(A * A')` do?
Additionally, do they all really have 512 rows? If so, you can use a preallocated 512 x 512 `C`. `sum(A * A') ≈ sum(A' * A)`, so you’ll want to make the smaller dim the outer dim.

Returning to my example, I looked into why the simply written version was slower, and it turns out the problem is with LLVM, so I’ve filed a bug there.

The simpler version gives LoopVectorization more freedom in choosing the order of the loops, in which case it chooses an order more like:

``````function sumAᵀA(A::AbstractMatrix{T}) where {T}
s = zero(T)
@avx for n ∈ axes(A,1), k ∈ axes(A,2)
t = zero(T)
for m ∈ axes(A,1)
t += A[m,k] * A[n,k]
end
s += t
end
s
end
``````

And this is indeed faster at many sizes than what I had written last time, and manages to workaround that LLVM performance-bug.

I’ll consider trying to automatically implement some workarounds that hopefully have less runtime overhead than doing something like the above, which should be slower than the simple
`for m ∈ axes(A,1), n ∈ axes(A,1), k ∈ axes(A,2); s += ...` version.

I suggest giving it a try; seems to scale well without resorting to extra levels of tiling:

``````julia> A = rand(512, 1000);

julia> @btime sumAᵀA(\$A)
4.047 ms (0 allocations: 0 bytes)
6.5659906453390986e7

julia> 2e-9*512^2*1000 / 4.047e-3
129.54978996787744
``````
2 Likes