Why is this simple function twice as slow as its Python version

oheil · April 11, 2021, 8:41am

I have problems timing the python code (not a pythonista here ) The %time and %timeit seem not to work inside a python shell (only works in a notebook?). Can someone give me a hint? Despite what I said I installed python3 anyways out of curiosity.

jzr · April 11, 2021, 9:23am

%time/%timeit magic is a command of the IPython shell (an alternative Python shell included with Jupyter), not the built-in Python REPL. You can benchmark using the IPython magics (if you have IPython installed, with the ipython command), or with the built-in timeit standard library module accessible in normal Python code.

lmiq · April 11, 2021, 9:48am

I could run the code above in a jupyter notebook, yes.

GunnarFarneback · April 11, 2021, 10:39am

Tip of the day:

julia> using IPython

julia> [Type `.` to enter IPython mode]

In [1]: %timeit 2+4
6.66 ns ± 0.00743 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

ChrisRackauckas · April 11, 2021, 11:35am

github.com/JuliaLang/julia

BLAS threads should default to physical not logical core count?

opened 01:08AM - 28 Sep 19 UTC

ChrisRackauckas

linear algebra multithreading

On an i7-8550u, OpenBLAS is defaulting to 8 threads. I was comparing to Recursiv…eFactorizations.jl, and saw the performance is like: ```julia using BenchmarkTools import LinearAlgebra, RecursiveFactorization ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()) LinearAlgebra.BLAS.set_num_threads(4) BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5 luflop(m, n) = n^3÷3 - n÷3 + m*n^2 luflop(n) = luflop(n, n) bas_mflops = Float64[] rec_mflops = Float64[] ns = 50:50:800 for n in ns A = rand(n, n) bt = @belapsed LinearAlgebra.lu!($(copy(A))) rt = @belapsed RecursiveFactorization.lu!($(copy(A))) push!(bas_mflops, luflop(n)/bt/1e9) push!(rec_mflops, luflop(n)/rt/1e9) end using Plots plt = plot(ns, bas_mflops, legend=:bottomright, lab="OpenBLAS", title="LU Factorization Benchmark", marker=:auto) plot!(plt, ns, rec_mflops, lab="RecursiveFactorization", marker=:auto) xaxis!(plt, "size (N x N)") yaxis!(plt, "GFLOPS") ``` 1 thread ![lubenchbnumthreads1](https://user-images.githubusercontent.com/1814174/65809331-79ef0000-e16a-11e9-8e34-7ee6a5421d29.png) 4 threads ![lubenchbnumthreads4](https://user-images.githubusercontent.com/1814174/65809337-81160e00-e16a-11e9-9e74-c000b8a469ec.png) 8 threads ![lubench](https://user-images.githubusercontent.com/1814174/65809340-85422b80-e16a-11e9-99a5-78facf02fb5f.png) Conclusion: the default that Julia chooses, 8 threads, is the worst, with 1 thread doing better. But using the number of physical cores, 4, is best. So there were a lot of issues on Discourse and Slack #gripes where essentially "setting BLAS threads to 1 is better than the default!", but it looks like it's because the default should be the number of physical and not logical threads. I am actually very surprised it's not set that way, and so I was wondering why, and also where this default is set (I couldn't find it).

oheil · April 11, 2021, 2:49pm

Thanks @GunnarFarneback , this again is an example how user friendly Julia is. Before your post, I installed Python3, pip3, IPython, numpy on my debian linux, and for this I had to do quite some research, which brought up obscure environments with virtualenv --python=python3 .venv and so on, until I finally was able to run OPs comparison.

Now, after your post and using IPython in Julia, I don’t want to know anymore, which one is faster. I just want to use Julia, because it is so much more elegant.

Ok, for the sake of completeness, this is my timing with (I)Python:

Code

(@v1.6) pkg> add IPython
using IPython
IPython.install_dependency("ipython")
.
In [1]: import numpy
   ...: from numpy import zeros
   ...: from numpy.random import rand
   ...:

In [2]: def test(n,k,tmp2):
   ...:     for i in range(k):
   ...:         t = rand(n, n)
   ...:         for j in range(k):
   ...:             tmp1 = rand(n, n)
   ...:             tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1
   ...:

In [3]: n = 300
   ...: k = 30
   ...: tmp2=zeros((n*k,n*k))
   ...:

In [4]: %timeit test(n,k,tmp2)
933 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to original OP Julia code:

Code

using BenchmarkTools
function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
    for i in 1:k
        t = rand(n, n) 
        for j in 1:k
            tmp1 = rand(n, n)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
        end
    end
end
n = 300
k = 30
tmp2=zeros((n*k,n*k))

julia> @btime test($n,$k,$tmp2)
  1.380 s (3660 allocations: 1.23 GiB)

Compared to current best Julia code (@dmolina):

Code

using BenchmarkTools,Random,LinearAlgebra
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
    for i in 1:k
 	    t = rand(n, n); tmp1 = similar(t) 
 	    for j in 1:k
		   rand!(tmp1)
		   @views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
	    end
   end
end
n = 300
k = 30
tmp2=zeros((n*k,n*k))

julia> @btime test2($n,$k,$tmp2)
  665.953 ms (120 allocations: 41.20 MiB)

This doesn’t add anything new.
What we actually need is someone with an iMac to check, if there is something special on these systems, which results in bad timings in Julia even when using LinearAlgebra package.

lmiq · April 11, 2021, 3:03pm

What you are reporting there is exactly what the OP reports now (Python being ~30% faster). And that is related to the fact of BLAS launching 2N processes for a N-physical core computer. At least here, the original OP code runs as:

julia> @btime test($n,$k,$tmp2)
  1.240 s (3660 allocations: 1.23 GiB)

but:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4)

julia> @btime test($n,$k,$tmp2)
  857.055 ms (3660 allocations: 1.23 GiB)

My computer has 4 physical cores. Before setting the number of threads of BLAS manually, it launches 8 threads. That is what @ChrisRackauckas linked above. It is a known issue.

(test2 above also becomes faster with that change).

oheil · April 11, 2021, 3:46pm

Not exactly.
Python being 30% faster is the comparison between matrix multiplication using numpy against Julia using standard * operator (test). As I don’t know enough Python and numpy my guess is, that numpy already is quite optimized for matrix multiplication, already using BLAS(?), but Julia doesn’t without using LinearAlgebra, therefor comparing apples with oranges.

This is not related to BLAS number of threads.

Introducing LinearAlgebra (test2) already made Julia faster. The effect of different number of threads is not so strong, or better, is more complex:

julia> BLAS.get_num_threads()
8

julia> @btime test2($n,$k,$tmp2)
  665.953 ms (120 allocations: 41.20 MiB)

julia> BLAS.set_num_threads(4)

julia> @btime test2($n,$k,$tmp2)
  760.411 ms (120 allocations: 41.20 MiB)

julia> BLAS.set_num_threads(6)

julia> @btime test2($n,$k,$tmp2)
  658.991 ms (120 allocations: 41.20 MiB)

julia> BLAS.set_num_threads(24)

julia> @btime test2($n,$k,$tmp2)
  704.523 ms (120 allocations: 41.20 MiB)

julia> BLAS.set_num_threads(1)

julia> @btime test2($n,$k,$tmp2)
  1.008 s (120 allocations: 41.20 MiB)

Even with a single thread only, Julias test2 (1.008 s) is nearly as fast as Python/numpy (933 ms).
Using 6 Threads seems to be the optimum (658.991 ms).
Going for more than 8 threads again decreases performance.
(This is problem and CPU/system specific and IMPORTANT: test2 is slightly tweaked to avoid unneeded allocations, which is an important point when using Julia. I don’t know if a similar tweak is needed or possible for the Python test function, but test and test2 aren’t equivalent anymore!)

The thread issue is interesting, but not so significant for this comparison IMO. I am still suspecting something special for MacOS, because OP said, that the proposals we made still result in better Python timings:

No one (except OP) has checked on a MacOS system or at least didn’t tell us.

GunnarFarneback · April 11, 2021, 4:20pm

Julia quite happily uses BLAS for matrix multiplication also without using LinearAlgebra. Why wouldn’t it?

oheil · April 11, 2021, 4:27pm

ok, why is m1*m2 so much slower than mul!(buf,m1,m2) ?

lmiq · April 11, 2021, 4:31pm

Actually the standard matrix multiplication in Julia is a call to BLAS gemv!, and probably Python’s one is the same (I don’t know how to check). test2 uses mul! in place, while in test there is a new matrix allocation at every matrix multiplication when one calls t*tmp1. That is probably the major difference between the two.

One needs to load LinearAlgebra to set the number of BLAS threads, even if BLAS is being used without the explicit loading of the package.

The effect of the number of BLAS threads seems consistent to me:

julia> using BenchmarkTools

julia> @btime test($n,$k,$tmp2)
  938.132 ms (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  1.192 s (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  1.195 s (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  1.206 s (3660 allocations: 1.23 GiB)

julia> using LinearAlgebra

julia> @btime test($n,$k,$tmp2)
  1.175 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(4) # this is actually reducing from 8 to 4

julia> @btime test($n,$k,$tmp2)
  820.822 ms (3660 allocations: 1.23 GiB)

julia> @btime test($n,$k,$tmp2)
  844.377 ms (3660 allocations: 1.23 GiB)

oheil · April 11, 2021, 4:41pm

Thanx, that explanation calms my raising confusion down again.
But it doesn’t workout for me:

julia> @btime test($n,$k,$tmp2)
  1.369 s (3660 allocations: 1.23 GiB)

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4)

julia> @btime test($n,$k,$tmp2)
  1.295 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(12)

julia> @btime test($n,$k,$tmp2)
  1.428 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(6)

julia> @btime test($n,$k,$tmp2)
  1.312 s (3660 allocations: 1.23 GiB)

on a AMD Ryzen 9 3900X 12-Core Processor.

Always slower than Python (933 ms) where test is OPs original Julia version.

lmiq · April 11, 2021, 5:00pm

Well, I think we would need to figure out what exactly Python is calling there. Here (Intel I7 8th gen - 4 physical cores) the Julia code is slightly faster with 4 threads:


julia> BLAS.set_num_threads(8)

julia> @btime test($n,$k,$tmp2)
  1.186 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(4)

julia> @btime test($n,$k,$tmp2)
  857.846 ms (3660 allocations: 1.23 GiB)

In [17]: %timeit test(n, k, tmp2)
1.02 s ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is what I would expect for the small speedup that might exist in the code which is not the matrix multiplication itself.

Now I see that apparently my numpy is using MKL (Julia is using openblas). MKL is usually faster. Maybe that makes a greater difference in your system than in mine, I think those performances are quite dependent on the architecture?

DNF · April 11, 2021, 7:17pm

dmolina:

function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n); tmp1 = similar(t) 
               for j in 1:k
                   rand!(tmp1)
                   @views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
               end
           end
       end

Why are t = rand(n, n); tmp1 = similar(t) inside the outer loop? Wouldn’t this work:

function test3!(tmp2::Array{Float64,2}, n::Integer, k::Integer)
    t = similar(tmp2, n, n)
    tmp1 = similar(t)
    for i in 0:k-1
        rand!(t)
        for j in 0:k-1
            rand!(tmp1)
            @views @inbounds mul!(tmp2[i*n+1:(i+1)*n, j*n+1:(j+1)*n], t, tmp1)
        end
    end
end

Still, this is all down to the matrix muls.

Semi · April 11, 2021, 8:56pm

Thanks everyone for such an interest in my problem!

I installed Julia 1.6 to be up-to-date. Now I’m using openblas since I was not able to install mkl (build MKL fails for some reason). But anyway, that made no difference in 1.4.

The original Julia code is now 40% slower than the python one.

The function

function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
    buf=zeros(n,n)
    for j in 1:k
        t = rand(n, n) 
        for i in 1:k
            tmp1 = rand(n, n)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = mul!(buf, t, tmp1)
        end
    end
end

is as fast as the python version.

And the function

using Random
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n); tmp1 = similar(t) 
               for j in 1:k
                   rand!(tmp1)
                   @views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
               end
           end
       end

is twice as fast (interchanging variables i and j makes no difference).
The gain here comes from @views @inbounds.

I don’t understand why Julia needs such a complex code to efficiently multiply two matrices and write them in the third matrix. This makes no sense to me and makes for a very unpleasant code to read.

rveltz · April 11, 2021, 9:00pm

I’d say you are not comparing the same things. Try to use MKL and then you will be able to do fair comparisons

Semi · April 11, 2021, 9:02pm

As I already mentioned, with Julia 1.4 and mkl the results are the same, I doubt that Julia 1.6+mkl would made a difference.

Semi · April 11, 2021, 9:22pm

A further remark. With

using Random
function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
    t = zeros(n,n)
    tmp1 = zeros(n,n)
    for i in 1:k
        rand!(t) 
        for j in 1:k
            rand!(tmp1)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
        end
    end
end;

Julia is 15% slower than python, a significant gain from 40% when it had to allocate memory for all the matrices in the loops but still slower than python which does the allocation.

mcabbott · April 11, 2021, 9:29pm

If you want to time BLAS library calls, I think it might be less confusing to time just one multiplication. And then track down the variables, which are, as people have mentioned, what BLAS library you are timing, and how many threads it is using. (And what computer you use, of course.) Sort that out first, but don’t pretend you are timing the host language.

If you have some more complicated function of interest, then there are other factors. There are cases where Julia’s allocation and garbage collection are slower, and many of the improvements posted are working around this. For a loop containing many matrix multiplications, you would also probably want to set BLAS to one thread, and have Threads.@threads on the outermost loop. But it’s hard to know what to fix here, as the entire test function seems like a baroque way of generating normally distributed numbers. Does writing @. tmp = n/4 + $sqrt(n/20) * randn() count as a solution? (It should be pretty quick!)

Semi · April 11, 2021, 9:35pm

This is not really the point. The function doesn’t do anything useful, it is just a test to see how Julia handles matrix operations inside loops. Such things are common in numerical linear algebra, this is not something esoteric.
In my opinion, I should not have to think about how many threads BLAS is using, as I didn’t have to in the python/numpy version.

Topic		Replies	Views
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3566	October 24, 2017
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2112	August 13, 2017
Why is python faster than Julia Performance	14	1915	March 12, 2020
Julia is slower than Python when appending elements to untyped arrays General Usage question	42	892	February 10, 2025
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10956	October 15, 2019

Why is this simple function twice as slow as its Python version

Related topics