This is actually slightly (2%) slower.
julia> using LinearAlgebra, Random, BenchmarkTools
julia> function test(rng, n::Int64, k::Int64, tmp2::Matrix{Float64})
t = Matrix{Float64}(undef, n, n)
tmp1 = Matrix{Float64}(undef, n, n)
for j in 1:k
t .= rand.(rng)
for i in 1:k
tmp1 .= rand.(rng)
@inbounds mul!(@view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n]), t, tmp1)
end
end
end
test (generic function with 2 methods)
julia> @btime test(rng, n, k, tmp2) setup=(rng = Random.default_rng(); n = 300; k = 30; tmp = Matrix{Float64}(undef, n * k, n * k));
949.974 ms (4 allocations: 1.37 MiB)
This strange. Your original function gives for me:
julia> @btime test($n,$k,$tmp2)
1.137 s (3660 allocations: 1.23 GiB)
Similar to your original timing.
Yeah, but I could also have done the same in python/cython.
As pdeffebach wrote, it is perplexing why there is such a difference in performance between vanilla versions of python and julia.
@inbound doesn’t change much, it’s mul! which does a larger step.
Different BLAS library? The bottleneck seems to be the matrix multiplication.
@giordano gives for me:
julia> @btime test(rng, n, k, tmp2) setup=(rng = Random.default_rng(); n = 300; k = 30; tmp = Matrix{Float64}(undef, n * k, n * k));
850.540 ms (4 allocations: 1.37 MiB)
Thats what I think too.
That was my idea as well.
But I tried both openblas and mkl, and there is no difference.
As I reached my limit of replies as a newbie, I can’t reply any more for the next 22 hours.
In that example of code, with .= you can avoid many allocations, because the matrix values are modified and not the complete matrix.
function test2(n)
k = 100
t = rand(n, n)
tmp1 = rand(n, n)
for i in 1:k
tmp1 .= t .+ t;
end
for j in 1:k
tmp1 .= tmp1.*t;
end
end
julia> @btime test(100)
7.308 ms (404 allocations: 15.43 MiB)
julia> @btime test2(100)
990.166 μs (4 allocations: 156.41 KiB)
I cannot check your original code
test(n::Int64,k::Int64,tmp2::Array{Float64,2})
But maybe that strategy could be useful for you.
In the flux of messages I had missed this one, but this is very likely the explanation of the difference, we were all doing matrix multiplication in Julia, and element-wise multiplication in Python
The python code was adjusted to do this also with: t@tmp1
Well, then I don’t understand the question about Python faster than Julia, using the original code (but with matrix multiplication for Python) I get
julia> @btime test(n, k, tmp2) setup=(n = 300; k = 30; tmp2 = Matrix{Float64}(undef, n * k, n * k));
1.258 s (3660 allocations: 1.23 GiB)
In [2]: %%timeit
...: test(n,k,tmp2)
...:
...:
1.98 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think you are right:
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
for i in 1:k
t = rand(n, n); tmp1 = similar(t)
for j in 1:k
rand!(tmp1)
tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t.*tmp1
end
end
end;
julia> @time test(n, k, tmp2)
1.625298 seconds (3.66 k allocations: 1.227 GiB, 0.60% gc time)
julia> @time test2(n,k,tmp2)
0.325837 seconds (1.92 k allocations: 659.253 MiB, 0.62% gc time)
I only have changed the matrix to broadcast multiplication and use rand! to avoid so many allocations.
@Semi, that version is faster than python version, didn’t it?
At OPs side it is a MacBook pro running Julia 1.4, which also seems to make a difference here.
I can only check the Julia code, and for it, it went down from
julia> @btime test($n,$k,$tmp2)
1.137 s (3660 allocations: 1.23 GiB)
in the original Julia code to
@btime test($n,$k,$tmp2)
809.509 ms (1862 allocations: 639.34 MiB)
using mul!
and @inbounds
.
Running in Julia 1.5.3 is significantly slower but performance gain using mul!
is similar.
Please read the complete thread.
We are comparing matrix multiplication:
Julia: t*tmp1
Python: t@tmp1
which OP has changed during discussion.
Maybe OP could edit his original post?
Yes, you are right. I have read the first email, and I did not see the change in the multiplication in the email. I think your proposal (using mul!) is the best option:
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
for i in 1:k
t = rand(n, n); tmp1 = similar(t)
for j in 1:k
rand!(tmp1)
@views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
end
end
end
Comparing both versions in my slow laptop (with Julia 1.6):
# Original code
julia> @time test(n,k,tmp2)
1.806398 seconds (3.66 k allocations: 1.227 GiB, 1.27% gc time)
# New version
julia> @time test2(n,k,tmp2)
1.453730 seconds (120 allocations: 41.203 MiB, 0.11% gc time)
It takes 33% less time than original (and a lot less memory).
Update: My laptop is clearly slower than I though, I need another one .
julia> @btime test2(n,k,tmp2)
663.240 ms (120 allocations: 41.20 MiB)
@dmolina 's version is best for me
I cannot reproduce this at all:
%timeit test(n, k, tmp2)
3.27 s ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original version in jl.
@btime test(n, k, tmp2)
1.648 s (3660 allocations: 1.23 GiB)
Version similar to @dmolina
julia> function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
tmp1=Matrix{Float64}(undef,n,n);t=Matrix{Float64}(undef,n,n);
for j in 1:k
Random.rand!(t)
for i in 1:k
Random.rand!(tmp1)
mul!(@inbounds(@view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n])), t, tmp1)
end
end
end;
julia> @btime test(n, k, tmp2)
1.046 s (4 allocations: 1.37 MiB)
All with n=300;k=30
EDIT: No idea how to optimize the python version
I cannot see that difference here either. But I noticed that
-
both are using multi-threading. But Julia is launching 8 threads, while Python is launching 4. My laptop has 4 physical cores. Maybe that explains differences in some system?
-
The times measured fluctuate in about 30% in both cases. Here I can get something between 1.2 and 0.9 seconds for either of them (using
%timeit
or@btime
).
(I’m using Julia 1.6).
Codes
import numpy
from numpy import zeros
from numpy.random import rand
def test(n,k,tmp2):
for i in range(k):
t = rand(n, n)
for j in range(k):
tmp1 = rand(n, n)
tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1
n = 300
k = 30
tmp2=zeros((n*k,n*k))
%timeit test(n, k, tmp2)
# 1.18 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
julia> function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
for i in 1:k
t = rand(n, n)
for j in 1:k
tmp1 = rand(n, n)
tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
end
end
end;
julia> n = 300
300
julia> k = 30
30
julia> tmp2=zeros((n*k,n*k));
julia> using BenchmarkTools
julia> @btime test($n,$k,$tmp2)
1.016 s (3660 allocations: 1.23 GiB)