Why matrix multiplication is much slower than PyTorch

yingqiuz · June 27, 2021, 7:43pm

For example, matrix multiplication of 10,000 x 10,100 matrices, single threaded

In julia:

BLAS.set_num_threads(1)
A = randn(10000, 10000)
B = randn(10000, 10000)
C = Matrix{Float64}(undef, 10000, 10000)

@benchmark mul!($C, $A, $B)

The result shows

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.192 s (0.00% GC)
  median time:      18.192 s (0.00% GC)
  mean time:        18.192 s (0.00% GC)
  maximum time:     18.192 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

I also tried LoopVectorization:

function mygemmavx!(C, A, B)
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(eltype(C))
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end
@benchmark mygemmavx!($C, $A, $B)

Results:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     149.960 s (0.00% GC)
  median time:      149.960 s (0.00% GC)
  mean time:        149.960 s (0.00% GC)
  maximum time:     149.960 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

However, using Pytorch:

import torch

torch.set_num_threads(1)
A = torch.randn(10000, 10000)
B = torch.randn(10000, 10000)
%timeit -n 3 torch.matmul(A, B)
torch.matmul(A, B)

It took only ~9s in average (7 runs, 3 loops each)

I’m so confused. Could anyone shed some light on this? Thank you!

jling · June 27, 2021, 7:46pm

because of this maybe: pytorch not respecting torch.set_num_threads · Issue #975 · pytorch/pytorch · GitHub

yingqiuz · June 27, 2021, 7:50pm

Thanks - but I monitored the usage and pytorch indeed took up to 100% cpu. Does it suggest it was single-threaded?

jling · June 27, 2021, 7:51pm

also, eh, pytorch default type is float32…

In [8]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000, dtype=torch.float64)
   ...: B = torch.randn(1000, 1000, dtype=torch.float64)
   ...: %timeit -n 5 torch.matmul(A, B)
43.1 ms ± 810 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

In [9]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000)
   ...: B = torch.randn(1000, 1000)
   ...: %timeit -n 5 torch.matmul(A, B)
22.1 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

julia> A = randn(Float32, 1000, 1000);

julia> B = randn(Float32, 1000, 1000);

julia> C = Matrix{Float32}(undef, 1000, 1000);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.951 ms (0.00% GC)
  median time:      22.395 ms (0.00% GC)
  mean time:        22.528 ms (0.00% GC)
  maximum time:     25.443 ms (0.00% GC)
  --------------
  samples:          222
  evals/sample:     1

we really need some FAQ saying please don’t benchmark a single BLAS instruction because they literally should be the same. In the case they are not, it’s either OpenBLAS vs. MKL, or one of them is not respecting # of threads setting

Elrod · June 27, 2021, 8:49pm

It’s missing optimizations for handling such large sizes. Single threaded, those optimizations are needed above 300x300, or so.
Octavian.jl will do much better, but will probably be a little behind OpenBLAS.

julia> M = K = N = 10_000; T = Float32; A = rand(T,M,K); B = rand(T,K,N); C0 = Array{T}(undef, M, N);

julia> @time mul!(C0, A, B);
  9.055349 seconds (2.27 M allocations: 120.548 MiB, 0.92% gc time, 5.70% compilation time)

julia> @time matmul_serial!(C0, A, B);
 21.109921 seconds (23.04 M allocations: 1.302 GiB, 0.51% gc time, 52.73% compilation time)

julia> @time mul!(C0, A, B);
  8.565136 seconds

julia> @time matmul_serial!(C0, A, B);
  9.987270 seconds

Octavian’s single threaded performance could use some work; the multithreaded performance is closer.
I think one reason why the performance is worse is because Octavian will only use a fraction of the L3 cache equal to the number of threads it’s using.
It (incorrectly, here) assumes that you aren’t running single threaded matmul benchmarks, and that if you’re running it single threaded, that’s because the other cores are busy doing something else that they’re going to need L3 cache for. Like maybe they’re also calling single threaded matmuls.

Topic		Replies	Views
Slow matrix multiplication in Julia compared to Python numpy New to Julia question	17	5467	May 19, 2018
Julia matrix-multiplication performance Performance linearalgebra	20	8208	October 30, 2022
Submatrix multiply Performance	7	965	February 29, 2020
Speeding up matrix multiplication with a subset of matrix's rows General Usage linearalgebra	12	950	July 24, 2019
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3545	October 24, 2017

Why matrix multiplication is much slower than PyTorch

Related topics