I guess these differences are related to the fact that the blas call in the background is using multi-threading with all cpus available, and what one gets depends on what is running in the computer besides the code.
Also, the Python version can be improved in most of the same points that the Julia version above, by preallocating the arrays. The difference in performance will get smaller, since the only computation there is the matrix multiplication. But, then, I don’t know how ugly or pretty would be the corresponding python code.
One note: Apparently numpy defaults to 32bit MKL. I am not sure what are the implications of this to result of a matrix multiplication. Yet, if one uses Octavian
, which is pure Julia, changing from one to the other number representation depends on the element types of the input matrices. With that, the Julia code becomes much faster:
julia> using Octavian, BenchmarkTools
function test_fast(n,k,tmp2)
t = Matrix{eltype(tmp2)}(undef,n,n)
tmp1 = similar(t)
for i in 1:k
t .= rand.()
for j in 1:k
tmp1 .= rand.()
buf = @view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n])
matmul!(buf,t,tmp1)
end
end
end
n = 300; k = 30; tmp2=zeros(Float32,n*k,n*k);
@btime test_fast($n,$k,$tmp2)
170.436 ms (4 allocations: 703.28 KiB)
Note that the function itself is agnostic relative to the number type, which is defined only by the way one initializes tmp2
outside the function. This is a beauty of Julia, which in this case is propagated all the way down to the matrix multiplication.