Why matrix multiplication is much slower than PyTorch

For example, matrix multiplication of 10,000 x 10,100 matrices, single threaded

In julia:

BLAS.set_num_threads(1)
A = randn(10000, 10000)
B = randn(10000, 10000)
C = Matrix{Float64}(undef, 10000, 10000)

@benchmark mul!($C, $A, $B)

The result shows

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.192 s (0.00% GC)
  median time:      18.192 s (0.00% GC)
  mean time:        18.192 s (0.00% GC)
  maximum time:     18.192 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

I also tried LoopVectorization:

function mygemmavx!(C, A, B)
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(eltype(C))
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end
@benchmark mygemmavx!($C, $A, $B)

Results:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     149.960 s (0.00% GC)
  median time:      149.960 s (0.00% GC)
  mean time:        149.960 s (0.00% GC)
  maximum time:     149.960 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

However, using Pytorch:

import torch

torch.set_num_threads(1)
A = torch.randn(10000, 10000)
B = torch.randn(10000, 10000)
%timeit -n 3 torch.matmul(A, B)
torch.matmul(A, B)

It took only ~9s in average (7 runs, 3 loops each)

I’m so confused. Could anyone shed some light on this? Thank you!

because of this maybe: pytorch not respecting torch.set_num_threads · Issue #975 · pytorch/pytorch · GitHub

Thanks - but I monitored the usage and pytorch indeed took up to 100% cpu. Does it suggest it was single-threaded?

also, eh, pytorch default type is float32…

In [8]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000, dtype=torch.float64)
   ...: B = torch.randn(1000, 1000, dtype=torch.float64)
   ...: %timeit -n 5 torch.matmul(A, B)
43.1 ms ± 810 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

In [9]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000)
   ...: B = torch.randn(1000, 1000)
   ...: %timeit -n 5 torch.matmul(A, B)
22.1 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
julia> A = randn(Float32, 1000, 1000);

julia> B = randn(Float32, 1000, 1000);

julia> C = Matrix{Float32}(undef, 1000, 1000);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.951 ms (0.00% GC)
  median time:      22.395 ms (0.00% GC)
  mean time:        22.528 ms (0.00% GC)
  maximum time:     25.443 ms (0.00% GC)
  --------------
  samples:          222
  evals/sample:     1

we really need some FAQ saying please don’t benchmark a single BLAS instruction because they literally should be the same. In the case they are not, it’s either OpenBLAS vs. MKL, or one of them is not respecting # of threads setting

11 Likes

It’s missing optimizations for handling such large sizes. Single threaded, those optimizations are needed above 300x300, or so.
Octavian.jl will do much better, but will probably be a little behind OpenBLAS.

julia> M = K = N = 10_000; T = Float32; A = rand(T,M,K); B = rand(T,K,N); C0 = Array{T}(undef, M, N);

julia> @time mul!(C0, A, B);
  9.055349 seconds (2.27 M allocations: 120.548 MiB, 0.92% gc time, 5.70% compilation time)

julia> @time matmul_serial!(C0, A, B);
 21.109921 seconds (23.04 M allocations: 1.302 GiB, 0.51% gc time, 52.73% compilation time)

julia> @time mul!(C0, A, B);
  8.565136 seconds

julia> @time matmul_serial!(C0, A, B);
  9.987270 seconds

Octavian’s single threaded performance could use some work; the multithreaded performance is closer.
I think one reason why the performance is worse is because Octavian will only use a fraction of the L3 cache equal to the number of threads it’s using.
It (incorrectly, here) assumes that you aren’t running single threaded matmul benchmarks, and that if you’re running it single threaded, that’s because the other cores are busy doing something else that they’re going to need L3 cache for. Like maybe they’re also calling single threaded matmuls.

5 Likes