Why matrix multiplication is much slower than PyTorch

also, eh, pytorch default type is float32…

In [8]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000, dtype=torch.float64)
   ...: B = torch.randn(1000, 1000, dtype=torch.float64)
   ...: %timeit -n 5 torch.matmul(A, B)
43.1 ms ± 810 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

In [9]: torch.set_num_threads(1)
   ...: A = torch.randn(1000, 1000)
   ...: B = torch.randn(1000, 1000)
   ...: %timeit -n 5 torch.matmul(A, B)
22.1 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
julia> A = randn(Float32, 1000, 1000);

julia> B = randn(Float32, 1000, 1000);

julia> C = Matrix{Float32}(undef, 1000, 1000);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.951 ms (0.00% GC)
  median time:      22.395 ms (0.00% GC)
  mean time:        22.528 ms (0.00% GC)
  maximum time:     25.443 ms (0.00% GC)
  --------------
  samples:          222
  evals/sample:     1

we really need some FAQ saying please don’t benchmark a single BLAS instruction because they literally should be the same. In the case they are not, it’s either OpenBLAS vs. MKL, or one of them is not respecting # of threads setting

11 Likes