For example, matrix multiplication of 10,000 x 10,100 matrices, single threaded
In julia:
BLAS.set_num_threads(1)
A = randn(10000, 10000)
B = randn(10000, 10000)
C = Matrix{Float64}(undef, 10000, 10000)
@benchmark mul!($C, $A, $B)
The result shows
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 18.192 s (0.00% GC)
median time: 18.192 s (0.00% GC)
mean time: 18.192 s (0.00% GC)
maximum time: 18.192 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
I also tried LoopVectorization:
function mygemmavx!(C, A, B)
@turbo for m ∈ axes(A,1), n ∈ axes(B,2)
Cmn = zero(eltype(C))
for k ∈ axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
@benchmark mygemmavx!($C, $A, $B)
Results:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 149.960 s (0.00% GC)
median time: 149.960 s (0.00% GC)
mean time: 149.960 s (0.00% GC)
maximum time: 149.960 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
However, using Pytorch:
import torch
torch.set_num_threads(1)
A = torch.randn(10000, 10000)
B = torch.randn(10000, 10000)
%timeit -n 3 torch.matmul(A, B)
torch.matmul(A, B)
It took only ~9s in average (7 runs, 3 loops each)
I’m so confused. Could anyone shed some light on this? Thank you!