also, eh, pytorch default type is float32…
In [8]: torch.set_num_threads(1)
...: A = torch.randn(1000, 1000, dtype=torch.float64)
...: B = torch.randn(1000, 1000, dtype=torch.float64)
...: %timeit -n 5 torch.matmul(A, B)
43.1 ms ± 810 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
In [9]: torch.set_num_threads(1)
...: A = torch.randn(1000, 1000)
...: B = torch.randn(1000, 1000)
...: %timeit -n 5 torch.matmul(A, B)
22.1 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
julia> A = randn(Float32, 1000, 1000);
julia> B = randn(Float32, 1000, 1000);
julia> C = Matrix{Float32}(undef, 1000, 1000);
julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.951 ms (0.00% GC)
median time: 22.395 ms (0.00% GC)
mean time: 22.528 ms (0.00% GC)
maximum time: 25.443 ms (0.00% GC)
--------------
samples: 222
evals/sample: 1
we really need some FAQ saying please don’t benchmark a single BLAS instruction because they literally should be the same. In the case they are not, it’s either OpenBLAS vs. MKL, or one of them is not respecting # of threads setting