Here is a minimal working example
function mygemavx!(C, A, B, bias)
@turbo for m ∈ axes(A,1)
Cmk = bias[m]
for k ∈ axes(A,2)
Cmk += A[m,k] * B[k]
end
C[m] = Cmk
end
C
end
A1=zeros(Float32,32,128)
A2=zeros(Float32,32,128).+1f-40
B=rand(Float32,128)
b=rand(Float32,32)
C=rand(Float32,32)
@btime mygemmavx!($C,$A1,$B,$b)
@btime mygemmavx!($C,$A2,$B,$b)
wich gives: 114.423 ns (0 allocations: 0 bytes)
14.678 μs (0 allocations: 0 bytes)
I don’t understand where this come from. I came across this by training weights for a neural network.
Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × Intel(R) Core™ i7-10700KF CPU @ 3.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 8 on 16 virtual cores