Poor performance multiplying many (large) matrices multithreaded

I am trying to multiply a bunch of (large and differently sized) matrices, and then do some reduction following each multiply.

Basically I tried to do this in a multithreaded loop, where I see good speed gains for small-ish matrices, but performance quickly degrades for large-ish matrices.

using LinearAlgebra, BenchmarkTools, Random
BLAS.set_num_threads(1)

function mult_reduce(M::Vector{Matrix{Float64}})
    n = length(M)
    s = 0.0
    for i in 1:n
        tmp = Transpose(M[i]) * M[i] # calculate Mi' Mi
        s  += sum(tmp)               # reduction
    end
    return s
end

function mult_reduce_threaded(M::Vector{Matrix{Float64}})
    n = length(M)
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:n
        tmp = Transpose(M[i]) * M[i]      # calculate Mi' Mi        
        s[Threads.threadid()] += sum(tmp) # reduction on each thread         
    end
    return sum(s) # reduce across threads
end

On small problems, I get 4~5x speedup with 8 physical cores:

# (small) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
    k = rand(100:200) # random matrix dimension
    M[i] = rand(k, k)
end

@btime mult_reduce(M) # 1 thread
  18.225 ms (200 allocations: 18.71 MiB)
1.0128044609863697e8

@btime mult_reduce_threaded(M) # 8 thread
  4.451 ms (259 allocations: 18.72 MiB)
1.0128044609863696e8

BUT multithreading seems to be slower than single-threaded for larger problems:

# (larger) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
    k = rand(1000:3000) # random matrix dimension
    M[i] = rand(k, k)
end

@btime mult_reduce(M) # 1 thread (not setting BLAS threads)
5.987 s (200 allocations: 3.12 GiB)
2.368913048551288e11

@btime mult_reduce(M) # 1 thread (setting BLAS threads = 1)
  21.042 s (200 allocations: 3.12 GiB)
2.368913048551288e11

@btime mult_reduce_threaded(M) # 8 thread (setting BLAS threads = 1)
  9.937 s (275 allocations: 3.12 GiB)
2.3689130485512885e11

Discussions in this post seem to suggest any allocation is terrible for multithreading since garbage collection is single-threaded.

Does this mean my application is doomed? I cannot preallocate those intermediate matrices (variable tmp) since each matrix is differently sized.

Please give me any suggestion to do these calculations. On my real problem, I am looking at 2x speed gains using 10 cores on a cluster. That isn’t as bad as this example but it is also quite disappointing.

The first thing to try is setting blas threads to 1. With big matrices, you’re probably over-allocating your CPU.

2 Likes

Yes, BLAS’s threads don’t play nicely with Julia’s threads (yet). Large matrix multiplication is already multithreaded but there’s a heuristic that turns it off for sufficiently small matrices. I don’t recall the exact cutoffs, but that may explain the differences here. Just use BLAS.set_num_threads(1) if you’re going to rely upon Julia’s threading at a higher level instead.

1 Like

Thank you for both of your replies. I updated my post to include timings from BLAS.set_num_threads(1) showing roughly 2x speedup on 8 cores. I have actually tried it in my application, which did not help.

2x speedup on 8 cores still seems poor, but I take this as “there’s nothing we can do at the moment”. Is this correct?

(mis-read, wrong answer)

1 Like
using LoopVectorization, LinearAlgebra, Random, BenchmarkTools
BLAS.set_num_threads(1)

function sumAᵀA(A::AbstractMatrix{T}) where {T}
    s = zero(T)
    # Unfortunately, this pattern in loops seems necessary for good performance.
    # I'll look into why the more obvious
    # `for m ∈ axes(A,1), n ∈ axes(A,1), k ∈ axes(A,2); s += ...`
    # performs poorly later.
    @avx for m ∈ axes(A,1), n ∈ axes(A,1)
        t = zero(T)
        for k ∈ axes(A,2)
            t += A[m,k] * A[n,k]
        end
        s += t
    end
    s
end
basesumAᵀA(A) = sum(Transpose(A) * A)

function mult_reduce(f::F, M::Vector{Matrix{Float64}}) where {F}
    n = length(M)
    s = 0.0
    for i in 1:n
        s  += f(M[i])
    end
    return s
end

function mult_reduce_threaded(f::F, M::Vector{Matrix{Float64}}) where {F}
    n = length(M)
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:n
        s[Threads.threadid()] += f(M[i])
    end
    return sum(s) # reduce across threads
end

Random.seed!(2020);
n = 100
M = Vector{Matrix{Float64}}(undef, n);
for i in 1:n
    k = rand(100:200) # random matrix dimension
    M[i] = rand(k, k)
end

This yields:

julia> @btime mult_reduce(basesumAᵀA, $M) # 1 thread
  8.981 ms (200 allocations: 17.67 MiB)
9.15843796874148e7

julia> @btime mult_reduce(sumAᵀA, $M) # 1 thread
  7.832 ms (0 allocations: 0 bytes)
9.158351500138862e7

julia> @btime mult_reduce_threaded(basesumAᵀA, $M) # 18 thread
  919.900 μs (292 allocations: 17.69 MiB)
9.158437968741481e7

julia> @btime mult_reduce_threaded(sumAᵀA, $M) # 18 thread
  648.190 μs (92 allocations: 13.39 KiB)
9.158351500138861e7

How big are the arrays you need for the actual application?

Raw LoopVectorization will start performing pretty badly beyond a certain size for matrix multiplication. That size is heavily CPU dependent, and based on CPU cache.
For my newest/fastest computer (the one I ran these benchmarks on) that size is rather large, over 200x200. For an old laptop, it’s closer to 50x50.

You can address this with blocking strategies (i.e., perform the matrix multiplication blockwise, where you copy each block into a preallocated buffer first) or if they’re not too large, perhaps simply tiling the arrays (similar to blocking, but using views instead of copying memory).
If you do this, you can get 0 allocations per mul, and achieve better performance than BLAS + sum.

3 Likes

My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization? I can probably try the tiling strategy you suggested, thank you. It may partially alleviate my allocations too.

You should also note that 512 is a bad number for cpus. It means your vectors columns are a multiple of the L2 cache entry size, so if you aren’t a little careful, you can see tons of cache misses as a result of columns over-writing other columns. I’ve seen 10x performance drops from this before.

1 Like

Hmm I didn’t know this. Sorry to keep this thread alive, but can you explain why columns will be over-writing other columns or what that means here? Does that mean I should choose something like 511 instead?

I actually purposely chose 512 because, according to graphs like this, I thought matrix dimensions in multiples of 8 maximizes CPU flops.

This is a bit in the weeds, but led to the weirdest performance bug I’ve ever seen. I’m going to use zen 2 processors as an example, but intel is very similar here.

The L2 cache on this processor is 512 KB, stored in 1024 buckets of 64 bytes each. If you have a matrix with 128 rows of float64s, this means that the [1,1] is 1024 bytes in memory away from [2,1]. This means that these end up in the same cache line. By itself, this isn’t a problem since the cache is 8-way associative, but all entries in the first row end up trying to fit in the same 8 cache lies, causing lots of problems. As a result, blas typically uses panels with one of the numbers prime to ensure that when you are accessing one of the matrices in the wrong order for memory, you don’t get collisions.

1 Like

Since the matrices are of very different sizes the fixed allocation strategy of Threads.@threads might not be very good. I have had some good experiences with using Channel for the work items and Threads.@threads for _ in 1:Threads.nthreads() loop to process them. There is also ThreadsPools.jl.

I think it is possible to use a too large tmp matrix and views to avoid memory allocations. Especially if the first dimension is always fixed you will end up with contiguous memory for the view of tmp.

Actually, I don’t think this should necessarily be a problem here.
512:

julia> using LoopVectorization, BenchmarkTools

julia> function AmulB!(C, A, B)
       @avx for m ∈ axes(C,1), n ∈ axes(C,2)
       Cₘₙ = zero(eltype(C))
       for k ∈ axes(B,1)
       Cₘₙ += A[m,k] * B[k,n]
       end
       C[m,n] = Cₘₙ
       end
       end
AmulB! (generic function with 1 method)

julia> M, K, N = 512, 72, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     44.239 μs (0.00% GC)
  median time:      44.365 μs (0.00% GC)
  mean time:        44.423 μs (0.00% GC)
  maximum time:     120.546 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 2e-9M*K*N / 44.239e-6
119.9940324148376

vs 520:

julia> M, K, N = 520, 72, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     43.211 μs (0.00% GC)
  median time:      43.345 μs (0.00% GC)
  mean time:        43.409 μs (0.00% GC)
  maximum time:     104.449 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 2e-9M*K*N / 43.211e-6
124.7682303117262

The CPU I ran this on (10980XE) has 8-way associative 32 KiB L1D cache/core

julia> 32 * (2 ^ 10) ÷ (8 * 8)
512

Yielding the fabled 512 as the magic(-ally bad) number.
But 119 (M = 512) vs 124 (M = 520) GFLOPS isn’t a big deal.
Both are very good; the theoretical peak for this CPU is 131.2.

Agner Fog’s example (see section 9.10, starting on page 103) is transposing a matrix.
We can try this:

julia> A = rand(512, 512); B = similar(A);

julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     844.904 μs (0.00% GC)
  median time:      850.551 μs (0.00% GC)
  mean time:        850.523 μs (0.00% GC)
  maximum time:     1.722 ms (0.00% GC)
  --------------
  samples:          5876
  evals/sample:     1

julia> A = rand(520, 520); B = similar(A);

julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     465.476 μs (0.00% GC)
  median time:      470.590 μs (0.00% GC)
  mean time:        470.852 μs (0.00% GC)
  maximum time:     873.865 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Tranposing a 512x512 matrix took about 80% longer than transposing a 520x520 matrix on my computer. Ouch!
(The problem is much less severe if you use FastTranspose.jl, where I get 285 microseconds for 512x512 vs 235 for 520x520.)

That is because what were contiguous loads (down a column) from A suddenly become 512 doubles apart when storing into B, so that each one evicts the previous cacheline.

So, to reproduce this in matrix multiplication, we should similarly expect pathological behavior when multiplying column-major A * B without packing into a tile-major format.
This is because (in column-major matmul) on each iteration of the k loop, we load vectors from a column of A, and a bunch of scalars from a row of B. So if B's stride is 512, each of these scalars will be from a different cacheline.

Yet, performance is still fine:

julia> M, K, N = 72, 512, 72;

julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N); C2 = A * B;

julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     43.920 μs (0.00% GC)
  median time:      44.285 μs (0.00% GC)
  mean time:        44.367 μs (0.00% GC)
  maximum time:     120.774 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 2e-9M*K*N / 43.920e-6
120.86557377049182

That’s 120 GFLOPS. I’m happy to see great performance, but it seems like I need to understand this problem better.

@biona001

My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization?

At that size, why not multithread the BLAS and sum? How does ThreadsX.sum(A * A') or simple sum(A * A') do?
Additionally, do they all really have 512 rows? If so, you can use a preallocated 512 x 512 C. sum(A * A') ≈ sum(A' * A), so you’ll want to make the smaller dim the outer dim.


Returning to my example, I looked into why the simply written version was slower, and it turns out the problem is with LLVM, so I’ve filed a bug there.

The simpler version gives LoopVectorization more freedom in choosing the order of the loops, in which case it chooses an order more like:

function sumAᵀA(A::AbstractMatrix{T}) where {T}
    s = zero(T)
    @avx for n ∈ axes(A,1), k ∈ axes(A,2)
        t = zero(T)
        for m ∈ axes(A,1)
            t += A[m,k] * A[n,k]
        end
        s += t
    end
    s
end

And this is indeed faster at many sizes than what I had written last time, and manages to workaround that LLVM performance-bug.

I’ll consider trying to automatically implement some workarounds that hopefully have less runtime overhead than doing something like the above, which should be slower than the simple
for m ∈ axes(A,1), n ∈ axes(A,1), k ∈ axes(A,2); s += ... version.

I suggest giving it a try; seems to scale well without resorting to extra levels of tiling:

julia> A = rand(512, 1000);

julia> @btime sumAᵀA($A)
  4.047 ms (0 allocations: 0 bytes)
6.5659906453390986e7

julia> 2e-9*512^2*1000 / 4.047e-3
129.54978996787744
2 Likes