Why is this simple function twice as slow as its Python version

I have problems timing the python code (not a pythonista here :slight_smile: ) The %time and %timeit seem not to work inside a python shell (only works in a notebook?). Can someone give me a hint? Despite what I said I installed python3 anyways out of curiosity.

%time/%timeit magic is a command of the IPython shell (an alternative Python shell included with Jupyter), not the built-in Python REPL. You can benchmark using the IPython magics (if you have IPython installed, with the ipython command), or with the built-in timeit standard library module accessible in normal Python code.

I could run the code above in a jupyter notebook, yes.

Tip of the day:

julia> using IPython

julia> [Type `.` to enter IPython mode]

In [1]: %timeit 2+4
6.66 ns ± 0.00743 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
14 Likes
6 Likes

Thanks @GunnarFarneback , this again is an example how user friendly Julia is. Before your post, I installed Python3, pip3, IPython, numpy on my debian linux, and for this I had to do quite some research, which brought up obscure environments with virtualenv --python=python3 .venv and so on, until I finally was able to run OPs comparison.

Now, after your post and using IPython in Julia, I don’t want to know anymore, which one is faster. I just want to use Julia, because it is so much more elegant.

Ok, for the sake of completeness, this is my timing with (I)Python:

Code
(@v1.6) pkg> add IPython
using IPython
IPython.install_dependency("ipython")
.
In [1]: import numpy
   ...: from numpy import zeros
   ...: from numpy.random import rand
   ...:

In [2]: def test(n,k,tmp2):
   ...:     for i in range(k):
   ...:         t = rand(n, n)
   ...:         for j in range(k):
   ...:             tmp1 = rand(n, n)
   ...:             tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1
   ...:

In [3]: n = 300
   ...: k = 30
   ...: tmp2=zeros((n*k,n*k))
   ...:
In [4]: %timeit test(n,k,tmp2)
933 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to original OP Julia code:

Code
using BenchmarkTools
function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
    for i in 1:k
        t = rand(n, n) 
        for j in 1:k
            tmp1 = rand(n, n)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
        end
    end
end
n = 300
k = 30
tmp2=zeros((n*k,n*k))
julia> @btime test($n,$k,$tmp2)
  1.380 s (3660 allocations: 1.23 GiB)

Compared to current best Julia code (@dmolina):

Code
using BenchmarkTools,Random,LinearAlgebra
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
    for i in 1:k
 	    t = rand(n, n); tmp1 = similar(t) 
 	    for j in 1:k
		   rand!(tmp1)
		   @views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
	    end
   end
end
n = 300
k = 30
tmp2=zeros((n*k,n*k))
julia> @btime test2($n,$k,$tmp2)
  665.953 ms (120 allocations: 41.20 MiB)

This doesn’t add anything new.
What we actually need is someone with an iMac to check, if there is something special on these systems, which results in bad timings in Julia even when using LinearAlgebra package.

1 Like

What you are reporting there is exactly what the OP reports now (Python being ~30% faster). And that is related to the fact of BLAS launching 2N processes for a N-physical core computer. At least here, the original OP code runs as:

julia> @btime test($n,$k,$tmp2)
  1.240 s (3660 allocations: 1.23 GiB)

but:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4)

julia> @btime test($n,$k,$tmp2)
  857.055 ms (3660 allocations: 1.23 GiB)

My computer has 4 physical cores. Before setting the number of threads of BLAS manually, it launches 8 threads. That is what @ChrisRackauckas linked above. It is a known issue.

(test2 above also becomes faster with that change).

2 Likes

Not exactly.
Python being 30% faster is the comparison between matrix multiplication using numpy against Julia using standard * operator (test). As I don’t know enough Python and numpy my guess is, that numpy already is quite optimized for matrix multiplication, already using BLAS(?), but Julia doesn’t without using LinearAlgebra, therefor comparing apples with oranges.

This is not related to BLAS number of threads.

Introducing LinearAlgebra (test2) already made Julia faster. The effect of different number of threads is not so strong, or better, is more complex:

julia> BLAS.get_num_threads()
8

julia> @btime test2($n,$k,$tmp2)
  665.953 ms (120 allocations: 41.20 MiB)
julia> BLAS.set_num_threads(4)

julia> @btime test2($n,$k,$tmp2)
  760.411 ms (120 allocations: 41.20 MiB)
julia> BLAS.set_num_threads(6)

julia> @btime test2($n,$k,$tmp2)
  658.991 ms (120 allocations: 41.20 MiB)
julia> BLAS.set_num_threads(24)

julia> @btime test2($n,$k,$tmp2)
  704.523 ms (120 allocations: 41.20 MiB)
julia> BLAS.set_num_threads(1)

julia> @btime test2($n,$k,$tmp2)
  1.008 s (120 allocations: 41.20 MiB)

Even with a single thread only, Julias test2 (1.008 s) is nearly as fast as Python/numpy (933 ms).
Using 6 Threads seems to be the optimum (658.991 ms).
Going for more than 8 threads again decreases performance.
(This is problem and CPU/system specific and IMPORTANT: test2 is slightly tweaked to avoid unneeded allocations, which is an important point when using Julia. I don’t know if a similar tweak is needed or possible for the Python test function, but test and test2 aren’t equivalent anymore!)

The thread issue is interesting, but not so significant for this comparison IMO. I am still suspecting something special for MacOS, because OP said, that the proposals we made still result in better Python timings:

No one (except OP) has checked on a MacOS system or at least didn’t tell us.

2 Likes

Julia quite happily uses BLAS for matrix multiplication also without using LinearAlgebra. Why wouldn’t it?

1 Like

ok, why is m1*m2 so much slower than mul!(buf,m1,m2) ?

Actually the standard matrix multiplication in Julia is a call to BLAS gemv!, and probably Python’s one is the same (I don’t know how to check). test2 uses mul! in place, while in test there is a new matrix allocation at every matrix multiplication when one calls t*tmp1. That is probably the major difference between the two.

One needs to load LinearAlgebra to set the number of BLAS threads, even if BLAS is being used without the explicit loading of the package.

The effect of the number of BLAS threads seems consistent to me:

julia> using BenchmarkTools

julia> @btime test($n,$k,$tmp2)
  938.132 ms (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  1.192 s (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  1.195 s (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  1.206 s (3660 allocations: 1.23 GiB)

julia> using LinearAlgebra

julia> @btime test($n,$k,$tmp2)
  1.175 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(4) # this is actually reducing from 8 to 4

julia> @btime test($n,$k,$tmp2)
  820.822 ms (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  844.377 ms (3660 allocations: 1.23 GiB)


2 Likes

Thanx, that explanation calms my raising confusion down again.
But it doesn’t workout for me:

julia> @btime test($n,$k,$tmp2)
  1.369 s (3660 allocations: 1.23 GiB)

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4)

julia> @btime test($n,$k,$tmp2)
  1.295 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(12)

julia> @btime test($n,$k,$tmp2)
  1.428 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(6)

julia> @btime test($n,$k,$tmp2)
  1.312 s (3660 allocations: 1.23 GiB)

on a AMD Ryzen 9 3900X 12-Core Processor.

Always slower than Python (933 ms) where test is OPs original Julia version.

Well, I think we would need to figure out what exactly Python is calling there. Here (Intel I7 8th gen - 4 physical cores) the Julia code is slightly faster with 4 threads:


julia> BLAS.set_num_threads(8)

julia> @btime test($n,$k,$tmp2)
  1.186 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(4)

julia> @btime test($n,$k,$tmp2)
  857.846 ms (3660 allocations: 1.23 GiB)

In [17]: %timeit test(n, k, tmp2)
1.02 s ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is what I would expect for the small speedup that might exist in the code which is not the matrix multiplication itself.

Now I see that apparently my numpy is using MKL (Julia is using openblas). MKL is usually faster. Maybe that makes a greater difference in your system than in mine, I think those performances are quite dependent on the architecture?

2 Likes

Why are t = rand(n, n); tmp1 = similar(t) inside the outer loop? Wouldn’t this work:

function test3!(tmp2::Array{Float64,2}, n::Integer, k::Integer)
    t = similar(tmp2, n, n)
    tmp1 = similar(t)
    for i in 0:k-1
        rand!(t)
        for j in 0:k-1
            rand!(tmp1)
            @views @inbounds mul!(tmp2[i*n+1:(i+1)*n, j*n+1:(j+1)*n], t, tmp1)
        end
    end
end

Still, this is all down to the matrix muls.

Thanks everyone for such an interest in my problem!

I installed Julia 1.6 to be up-to-date. Now I’m using openblas since I was not able to install mkl (build MKL fails for some reason). But anyway, that made no difference in 1.4.

The original Julia code is now 40% slower than the python one.

The function

function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
    buf=zeros(n,n)
    for j in 1:k
        t = rand(n, n) 
        for i in 1:k
            tmp1 = rand(n, n)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = mul!(buf, t, tmp1)
        end
    end
end

is as fast as the python version.

And the function

using Random
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n); tmp1 = similar(t) 
               for j in 1:k
                   rand!(tmp1)
                   @views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
               end
           end
       end

is twice as fast (interchanging variables i and j makes no difference).
The gain here comes from @views @inbounds.

I don’t understand why Julia needs such a complex code to efficiently multiply two matrices and write them in the third matrix. This makes no sense to me and makes for a very unpleasant code to read.

I’d say you are not comparing the same things. Try to use MKL and then you will be able to do fair comparisons

As I already mentioned, with Julia 1.4 and mkl the results are the same, I doubt that Julia 1.6+mkl would made a difference.

A further remark. With

using Random
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
    t = zeros(n,n)
    tmp1 = zeros(n,n)
    for i in 1:k
        rand!(t) 
        for j in 1:k
            rand!(tmp1)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
        end
    end
end;

Julia is 15% slower than python, a significant gain from 40% when it had to allocate memory for all the matrices in the loops but still slower than python which does the allocation.

If you want to time BLAS library calls, I think it might be less confusing to time just one multiplication. And then track down the variables, which are, as people have mentioned, what BLAS library you are timing, and how many threads it is using. (And what computer you use, of course.) Sort that out first, but don’t pretend you are timing the host language.

If you have some more complicated function of interest, then there are other factors. There are cases where Julia’s allocation and garbage collection are slower, and many of the improvements posted are working around this. For a loop containing many matrix multiplications, you would also probably want to set BLAS to one thread, and have Threads.@threads on the outermost loop. But it’s hard to know what to fix here, as the entire test function seems like a baroque way of generating normally distributed numbers. Does writing @. tmp = n/4 + $sqrt(n/20) * randn() count as a solution? (It should be pretty quick!)

2 Likes

This is not really the point. The function doesn’t do anything useful, it is just a test to see how Julia handles matrix operations inside loops. Such things are common in numerical linear algebra, this is not something esoteric.
In my opinion, I should not have to think about how many threads BLAS is using, as I didn’t have to in the python/numpy version.