Why is this simple function twice as slow as its Python version

This is actually slightly (2%) slower.

julia> using LinearAlgebra, Random, BenchmarkTools

julia> function test(rng, n::Int64, k::Int64, tmp2::Matrix{Float64})
           t = Matrix{Float64}(undef, n, n)
           tmp1 = Matrix{Float64}(undef, n, n)
           for j in 1:k
               t .= rand.(rng) 
               for i in 1:k
                   tmp1 .= rand.(rng)
                   @inbounds mul!(@view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n]), t, tmp1)
               end
           end
       end
test (generic function with 2 methods)

julia> @btime test(rng, n, k, tmp2) setup=(rng = Random.default_rng(); n = 300; k = 30; tmp = Matrix{Float64}(undef, n * k, n * k));
  949.974 ms (4 allocations: 1.37 MiB)

This strange. Your original function gives for me:

julia> @btime test($n,$k,$tmp2)
  1.137 s (3660 allocations: 1.23 GiB)

Similar to your original timing.

Yeah, but I could also have done the same in python/cython.
As pdeffebach wrote, it is perplexing why there is such a difference in performance between vanilla versions of python and julia.

1 Like

@inbound doesn’t change much, it’s mul! which does a larger step.

Different BLAS library? The bottleneck seems to be the matrix multiplication.

3 Likes

@giordano gives for me:

julia> @btime test(rng, n, k, tmp2) setup=(rng = Random.default_rng(); n = 300; k = 30; tmp = Matrix{Float64}(undef, n * k, n * k));
  850.540 ms (4 allocations: 1.37 MiB)

Thats what I think too.

1 Like

That was my idea as well.
But I tried both openblas and mkl, and there is no difference.

As I reached my limit of replies as a newbie, I can’t reply any more for the next 22 hours.

In that example of code, with .= you can avoid many allocations, because the matrix values are modified and not the complete matrix.

function test2(n)
           k = 100
           t = rand(n, n)
           tmp1 = rand(n, n) 
           for i in 1:k
               tmp1 .= t .+ t;
           end
           for j in 1:k
               tmp1 .= tmp1.*t;
           end
       end
julia> @btime test(100)
  7.308 ms (404 allocations: 15.43 MiB)
julia> @btime test2(100)
 990.166 μs (4 allocations: 156.41 KiB)

I cannot check your original code

test(n::Int64,k::Int64,tmp2::Array{Float64,2})

But maybe that strategy could be useful for you.

2 Likes

In the flux of messages I had missed this one, but this is very likely the explanation of the difference, we were all doing matrix multiplication in Julia, and element-wise multiplication in Python

2 Likes

The python code was adjusted to do this also with: t@tmp1

Well, then I don’t understand the question about Python faster than Julia, using the original code (but with matrix multiplication for Python) I get

julia> @btime test(n, k, tmp2) setup=(n = 300; k = 30; tmp2 = Matrix{Float64}(undef, n * k, n * k));
  1.258 s (3660 allocations: 1.23 GiB)
In [2]: %%timeit
   ...: test(n,k,tmp2)
   ...: 
   ...: 
1.98 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2 Likes

I think you are right:

function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n); tmp1 = similar(t) 
               for j in 1:k
                   rand!(tmp1)
                   tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t.*tmp1
               end
           end
       end;
julia> @time test(n, k, tmp2)
 1.625298 seconds (3.66 k allocations: 1.227 GiB, 0.60% gc time)
julia> @time test2(n,k,tmp2)
  0.325837 seconds (1.92 k allocations: 659.253 MiB, 0.62% gc time)

I only have changed the matrix to broadcast multiplication and use rand! to avoid so many allocations.

@Semi, that version is faster than python version, didn’t it?

2 Likes

At OPs side it is a MacBook pro running Julia 1.4, which also seems to make a difference here.

I can only check the Julia code, and for it, it went down from

julia> @btime test($n,$k,$tmp2)
  1.137 s (3660 allocations: 1.23 GiB)

in the original Julia code to

@btime test($n,$k,$tmp2)
  809.509 ms (1862 allocations: 639.34 MiB)

using mul! and @inbounds.

Running in Julia 1.5.3 is significantly slower but performance gain using mul! is similar.

Please read the complete thread.
We are comparing matrix multiplication:
Julia: t*tmp1
Python: t@tmp1
which OP has changed during discussion.
Maybe OP could edit his original post?

Yes, you are right. I have read the first email, and I did not see the change in the multiplication in the email. I think your proposal (using mul!) is the best option:

function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n); tmp1 = similar(t) 
               for j in 1:k
                   rand!(tmp1)
                   @views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
               end
           end
       end

Comparing both versions in my slow laptop (with Julia 1.6):

# Original code
julia> @time test(n,k,tmp2)
  1.806398 seconds (3.66 k allocations: 1.227 GiB, 1.27% gc time)
# New version
julia> @time test2(n,k,tmp2)
  1.453730 seconds (120 allocations: 41.203 MiB, 0.11% gc time)

It takes 33% less time than original (and a lot less memory).

Update: My laptop is clearly slower than I though, I need another one :slight_smile: .

1 Like
julia> @btime test2(n,k,tmp2)
  663.240 ms (120 allocations: 41.20 MiB)

@dmolina 's version is best for me

I cannot reproduce this at all:

%timeit test(n, k, tmp2)
3.27 s ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Original version in jl.

@btime test(n, k, tmp2)
  1.648 s (3660 allocations: 1.23 GiB)

Version similar to @dmolina

julia> function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
                  tmp1=Matrix{Float64}(undef,n,n);t=Matrix{Float64}(undef,n,n);
                  for j in 1:k
                      Random.rand!(t) 
                      for i in 1:k
                          Random.rand!(tmp1)
                          mul!(@inbounds(@view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n])), t, tmp1)
                      end
                  end
              end;

julia> @btime test(n, k, tmp2)
  1.046 s (4 allocations: 1.37 MiB)

All with n=300;k=30
EDIT: No idea how to optimize the python version

1 Like

I cannot see that difference here either. But I noticed that

  1. both are using multi-threading. But Julia is launching 8 threads, while Python is launching 4. My laptop has 4 physical cores. Maybe that explains differences in some system?

  2. The times measured fluctuate in about 30% in both cases. Here I can get something between 1.2 and 0.9 seconds for either of them (using %timeit or @btime).

(I’m using Julia 1.6).

Codes
import numpy
from numpy import zeros
from numpy.random import rand
def test(n,k,tmp2):
    for i in range(k):
        t = rand(n, n) 
        for j in range(k):
            tmp1 = rand(n, n)
            tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1
n = 300
k = 30
tmp2=zeros((n*k,n*k))
%timeit test(n, k, tmp2)
# 1.18 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
julia> function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n) 
               for j in 1:k
                   tmp1 = rand(n, n)
                   tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
               end
           end
       end;

julia> n = 300
300

julia> k = 30
30

julia> tmp2=zeros((n*k,n*k));

julia> using BenchmarkTools

julia> @btime test($n,$k,$tmp2)
  1.016 s (3660 allocations: 1.23 GiB)
1 Like