# Poor performance multiplying many (large) matrices multithreaded

I am trying to multiply a bunch of (large and differently sized) matrices, and then do some reduction following each multiply.

Basically I tried to do this in a multithreaded loop, where I see good speed gains for small-ish matrices, but performance quickly degrades for large-ish matrices.

``````using LinearAlgebra, BenchmarkTools, Random

function mult_reduce(M::Vector{Matrix{Float64}})
n = length(M)
s = 0.0
for i in 1:n
tmp = Transpose(M[i]) * M[i] # calculate Mi' Mi
s  += sum(tmp)               # reduction
end
return s
end

n = length(M)
tmp = Transpose(M[i]) * M[i]      # calculate Mi' Mi
end
return sum(s) # reduce across threads
end
``````

On small problems, I get 4~5x speedup with 8 physical cores:

``````# (small) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
k = rand(100:200) # random matrix dimension
M[i] = rand(k, k)
end

18.225 ms (200 allocations: 18.71 MiB)
1.0128044609863697e8

4.451 ms (259 allocations: 18.72 MiB)
1.0128044609863696e8
``````

``````# (larger) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
k = rand(1000:3000) # random matrix dimension
M[i] = rand(k, k)
end

5.987 s (200 allocations: 3.12 GiB)
2.368913048551288e11

21.042 s (200 allocations: 3.12 GiB)
2.368913048551288e11

9.937 s (275 allocations: 3.12 GiB)
2.3689130485512885e11
``````

Discussions in this post seem to suggest any allocation is terrible for multithreading since garbage collection is single-threaded.

Does this mean my application is doomed? I cannot preallocate those intermediate matrices (variable `tmp`) since each matrix is differently sized.

Please give me any suggestion to do these calculations. On my real problem, I am looking at 2x speed gains using 10 cores on a cluster. That isnāt as bad as this example but it is also quite disappointing.

The first thing to try is setting blas threads to 1. With big matrices, youāre probably over-allocating your CPU.

2 Likes

Yes, BLASās threads donāt play nicely with Juliaās threads (yet). Large matrix multiplication is already multithreaded but thereās a heuristic that turns it off for sufficiently small matrices. I donāt recall the exact cutoffs, but that may explain the differences here. Just use `BLAS.set_num_threads(1)` if youāre going to rely upon Juliaās threading at a higher level instead.

1 Like

Thank you for both of your replies. I updated my post to include timings from `BLAS.set_num_threads(1)` showing roughly 2x speedup on 8 cores. I have actually tried it in my application, which did not help.

2x speedup on 8 cores still seems poor, but I take this as āthereās nothing we can do at the momentā. Is this correct?

1 Like
``````using LoopVectorization, LinearAlgebra, Random, BenchmarkTools

function sumAįµA(A::AbstractMatrix{T}) where {T}
s = zero(T)
# Unfortunately, this pattern in loops seems necessary for good performance.
# I'll look into why the more obvious
# `for m ā axes(A,1), n ā axes(A,1), k ā axes(A,2); s += ...`
# performs poorly later.
@avx for m ā axes(A,1), n ā axes(A,1)
t = zero(T)
for k ā axes(A,2)
t += A[m,k] * A[n,k]
end
s += t
end
s
end
basesumAįµA(A) = sum(Transpose(A) * A)

function mult_reduce(f::F, M::Vector{Matrix{Float64}}) where {F}
n = length(M)
s = 0.0
for i in 1:n
s  += f(M[i])
end
return s
end

n = length(M)
end
return sum(s) # reduce across threads
end

Random.seed!(2020);
n = 100
M = Vector{Matrix{Float64}}(undef, n);
for i in 1:n
k = rand(100:200) # random matrix dimension
M[i] = rand(k, k)
end
``````

This yields:

``````julia> @btime mult_reduce(basesumAįµA, \$M) # 1 thread
8.981 ms (200 allocations: 17.67 MiB)
9.15843796874148e7

julia> @btime mult_reduce(sumAįµA, \$M) # 1 thread
7.832 ms (0 allocations: 0 bytes)
9.158351500138862e7

919.900 Ī¼s (292 allocations: 17.69 MiB)
9.158437968741481e7

648.190 Ī¼s (92 allocations: 13.39 KiB)
9.158351500138861e7
``````

How big are the arrays you need for the actual application?

Raw LoopVectorization will start performing pretty badly beyond a certain size for matrix multiplication. That size is heavily CPU dependent, and based on CPU cache.
For my newest/fastest computer (the one I ran these benchmarks on) that size is rather large, over 200x200. For an old laptop, itās closer to 50x50.

You can address this with blocking strategies (i.e., perform the matrix multiplication blockwise, where you copy each block into a preallocated buffer first) or if theyāre not too large, perhaps simply tiling the arrays (similar to blocking, but using views instead of copying memory).
If you do this, you can get 0 allocations per mul, and achieve better performance than BLAS + sum.

3 Likes

My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization? I can probably try the tiling strategy you suggested, thank you. It may partially alleviate my allocations too.

You should also note that 512 is a bad number for cpus. It means your vectors columns are a multiple of the L2 cache entry size, so if you arenāt a little careful, you can see tons of cache misses as a result of columns over-writing other columns. Iāve seen 10x performance drops from this before.

1 Like

Hmm I didnāt know this. Sorry to keep this thread alive, but can you explain why columns will be over-writing other columns or what that means here? Does that mean I should choose something like 511 instead?

I actually purposely chose 512 because, according to graphs like this, I thought matrix dimensions in multiples of 8 maximizes CPU flops.

This is a bit in the weeds, but led to the weirdest performance bug Iāve ever seen. Iām going to use zen 2 processors as an example, but intel is very similar here.

The L2 cache on this processor is 512 KB, stored in 1024 buckets of 64 bytes each. If you have a matrix with 128 rows of float64s, this means that the [1,1] is 1024 bytes in memory away from [2,1]. This means that these end up in the same cache line. By itself, this isnāt a problem since the cache is 8-way associative, but all entries in the first row end up trying to fit in the same 8 cache lies, causing lots of problems. As a result, blas typically uses panels with one of the numbers prime to ensure that when you are accessing one of the matrices in the wrong order for memory, you donāt get collisions.

1 Like

Since the matrices are of very different sizes the fixed allocation strategy of `Threads.@threads` might not be very good. I have had some good experiences with using `Channel` for the work items and `Threads.@threads for _ in 1:Threads.nthreads()` loop to process them. There is also ThreadsPools.jl.

I think it is possible to use a too large `tmp` matrix and views to avoid memory allocations. Especially if the first dimension is always fixed you will end up with contiguous memory for the view of tmp.

Actually, I donāt think this should necessarily be a problem here.
512:

``````julia> using LoopVectorization, BenchmarkTools

julia> function AmulB!(C, A, B)
@avx for m ā axes(C,1), n ā axes(C,2)
Cāā = zero(eltype(C))
for k ā axes(B,1)
Cāā += A[m,k] * B[k,n]
end
C[m,n] = Cāā
end
end
AmulB! (generic function with 1 method)

julia> M, K, N = 512, 72, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!(\$C1, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     44.239 Ī¼s (0.00% GC)
median time:      44.365 Ī¼s (0.00% GC)
mean time:        44.423 Ī¼s (0.00% GC)
maximum time:     120.546 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> 2e-9M*K*N / 44.239e-6
119.9940324148376
``````

vs 520:

``````julia> M, K, N = 520, 72, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!(\$C1, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     43.211 Ī¼s (0.00% GC)
median time:      43.345 Ī¼s (0.00% GC)
mean time:        43.409 Ī¼s (0.00% GC)
maximum time:     104.449 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> 2e-9M*K*N / 43.211e-6
124.7682303117262
``````

The CPU I ran this on (10980XE) has 8-way associative 32 KiB L1D cache/core

``````julia> 32 * (2 ^ 10) Ć· (8 * 8)
512
``````

Yielding the fabled 512 as the magic(-ally bad) number.
But 119 (M = 512) vs 124 (M = 520) GFLOPS isnāt a big deal.
Both are very good; the theoretical peak for this CPU is 131.2.

Agner Fogās example (see section 9.10, starting on page 103) is transposing a matrix.
We can try this:

``````julia> A = rand(512, 512); B = similar(A);

julia> @benchmark copyto!(\$B, \$(A'))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     844.904 Ī¼s (0.00% GC)
median time:      850.551 Ī¼s (0.00% GC)
mean time:        850.523 Ī¼s (0.00% GC)
maximum time:     1.722 ms (0.00% GC)
--------------
samples:          5876
evals/sample:     1

julia> A = rand(520, 520); B = similar(A);

julia> @benchmark copyto!(\$B, \$(A'))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     465.476 Ī¼s (0.00% GC)
median time:      470.590 Ī¼s (0.00% GC)
mean time:        470.852 Ī¼s (0.00% GC)
maximum time:     873.865 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

Tranposing a 512x512 matrix took about 80% longer than transposing a 520x520 matrix on my computer. Ouch!
(The problem is much less severe if you use FastTranspose.jl, where I get 285 microseconds for 512x512 vs 235 for 520x520.)

That is because what were contiguous loads (down a column) from `A` suddenly become 512 doubles apart when storing into `B`, so that each one evicts the previous cacheline.

So, to reproduce this in matrix multiplication, we should similarly expect pathological behavior when multiplying column-major `A * B` without packing into a tile-major format.
This is because (in column-major matmul) on each iteration of the `k` loop, we load vectors from a column of `A`, and a bunch of scalars from a row of `B`. So if `B`'s stride is 512, each of these scalars will be from a different cacheline.

Yet, performance is still fine:

``````julia> M, K, N = 72, 512, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N); C2 = A * B;

julia> @benchmark AmulB!(\$C1, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     43.920 Ī¼s (0.00% GC)
median time:      44.285 Ī¼s (0.00% GC)
mean time:        44.367 Ī¼s (0.00% GC)
maximum time:     120.774 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> 2e-9M*K*N / 43.920e-6
120.86557377049182
``````

Thatās 120 GFLOPS. Iām happy to see great performance, but it seems like I need to understand this problem better.

@biona001

My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization?

At that size, why not multithread the BLAS and sum? How does `ThreadsX.sum(A * A')` or simple `sum(A * A')` do?
Additionally, do they all really have 512 rows? If so, you can use a preallocated 512 x 512 `C`. `sum(A * A') ā sum(A' * A)`, so youāll want to make the smaller dim the outer dim.

Returning to my example, I looked into why the simply written version was slower, and it turns out the problem is with LLVM, so Iāve filed a bug there.

The simpler version gives LoopVectorization more freedom in choosing the order of the loops, in which case it chooses an order more like:

``````function sumAįµA(A::AbstractMatrix{T}) where {T}
s = zero(T)
@avx for n ā axes(A,1), k ā axes(A,2)
t = zero(T)
for m ā axes(A,1)
t += A[m,k] * A[n,k]
end
s += t
end
s
end
``````

And this is indeed faster at many sizes than what I had written last time, and manages to workaround that LLVM performance-bug.

Iāll consider trying to automatically implement some workarounds that hopefully have less runtime overhead than doing something like the above, which should be slower than the simple
`for m ā axes(A,1), n ā axes(A,1), k ā axes(A,2); s += ...` version.

I suggest giving it a try; seems to scale well without resorting to extra levels of tiling:

``````julia> A = rand(512, 1000);

julia> @btime sumAįµA(\$A)
4.047 ms (0 allocations: 0 bytes)
6.5659906453390986e7

julia> 2e-9*512^2*1000 / 4.047e-3
129.54978996787744
``````
2 Likes