Why matrix multiplication is much slower than PyTorch

jling · June 27, 2021, 7:51pm

also, eh, pytorch default type is float32…

In [8]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000, dtype=torch.float64)
   ...: B = torch.randn(1000, 1000, dtype=torch.float64)
   ...: %timeit -n 5 torch.matmul(A, B)
43.1 ms ± 810 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

In [9]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000)
   ...: B = torch.randn(1000, 1000)
   ...: %timeit -n 5 torch.matmul(A, B)
22.1 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

julia> A = randn(Float32, 1000, 1000);

julia> B = randn(Float32, 1000, 1000);

julia> C = Matrix{Float32}(undef, 1000, 1000);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.951 ms (0.00% GC)
  median time:      22.395 ms (0.00% GC)
  mean time:        22.528 ms (0.00% GC)
  maximum time:     25.443 ms (0.00% GC)
  --------------
  samples:          222
  evals/sample:     1

we really need some FAQ saying please don’t benchmark a single BLAS instruction because they literally should be the same. In the case they are not, it’s either OpenBLAS vs. MKL, or one of them is not respecting # of threads setting

Topic		Replies	Views
Matrix vector multiplication Performance question	4	909	September 27, 2020
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2480	July 13, 2020
Multi-threading of julia-1.8.5 does not improve speed when combined with BLAS New to Julia	17	1463	May 1, 2023
Julia matrix-multiplication performance Performance linearalgebra	20	8686	October 30, 2022
Slow matrix multiplication in Julia compared to Python numpy New to Julia question	17	5552	May 19, 2018

Why matrix multiplication is much slower than PyTorch

Related topics