Why is this simple function twice as slow as its Python version

Semi · April 10, 2021, 8:24pm

This is actually slightly (2%) slower.

giordano · April 10, 2021, 8:26pm

julia> using LinearAlgebra, Random, BenchmarkTools

julia> function test(rng, n::Int64, k::Int64, tmp2::Matrix{Float64})
           t = Matrix{Float64}(undef, n, n)
           tmp1 = Matrix{Float64}(undef, n, n)
           for j in 1:k
               t .= rand.(rng) 
               for i in 1:k
                   tmp1 .= rand.(rng)
                   @inbounds mul!(@view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n]), t, tmp1)
               end
           end
       end
test (generic function with 2 methods)

julia> @btime test(rng, n, k, tmp2) setup=(rng = Random.default_rng(); n = 300; k = 30; tmp = Matrix{Float64}(undef, n * k, n * k));
  949.974 ms (4 allocations: 1.37 MiB)

oheil · April 10, 2021, 8:26pm

This strange. Your original function gives for me:

julia> @btime test($n,$k,$tmp2)
  1.137 s (3660 allocations: 1.23 GiB)

Similar to your original timing.

Semi · April 10, 2021, 8:28pm

Yeah, but I could also have done the same in python/cython.
As pdeffebach wrote, it is perplexing why there is such a difference in performance between vanilla versions of python and julia.

oheil · April 10, 2021, 8:29pm

@inbound doesn’t change much, it’s mul! which does a larger step.

giordano · April 10, 2021, 8:31pm

Different BLAS library? The bottleneck seems to be the matrix multiplication.

oheil · April 10, 2021, 8:31pm

@giordano gives for me:

julia> @btime test(rng, n, k, tmp2) setup=(rng = Random.default_rng(); n = 300; k = 30; tmp = Matrix{Float64}(undef, n * k, n * k));
  850.540 ms (4 allocations: 1.37 MiB)

oheil · April 10, 2021, 8:31pm

Thats what I think too.

Semi · April 10, 2021, 8:32pm

That was my idea as well.
But I tried both openblas and mkl, and there is no difference.

As I reached my limit of replies as a newbie, I can’t reply any more for the next 22 hours.

dmolina · April 10, 2021, 8:42pm

In that example of code, with .= you can avoid many allocations, because the matrix values are modified and not the complete matrix.

function test2(n)
           k = 100
           t = rand(n, n)
           tmp1 = rand(n, n) 
           for i in 1:k
               tmp1 .= t .+ t;
           end
           for j in 1:k
               tmp1 .= tmp1.*t;
           end
       end

julia> @btime test(100)
  7.308 ms (404 allocations: 15.43 MiB)
julia> @btime test2(100)
 990.166 μs (4 allocations: 156.41 KiB)

I cannot check your original code

test(n::Int64,k::Int64,tmp2::Array{Float64,2})

But maybe that strategy could be useful for you.

giordano · April 10, 2021, 8:46pm

In the flux of messages I had missed this one, but this is very likely the explanation of the difference, we were all doing matrix multiplication in Julia, and element-wise multiplication in Python

oheil · April 10, 2021, 8:48pm

The python code was adjusted to do this also with: t@tmp1

giordano · April 10, 2021, 8:51pm

Well, then I don’t understand the question about Python faster than Julia, using the original code (but with matrix multiplication for Python) I get

julia> @btime test(n, k, tmp2) setup=(n = 300; k = 30; tmp2 = Matrix{Float64}(undef, n * k, n * k));
  1.258 s (3660 allocations: 1.23 GiB)

In [2]: %%timeit
   ...: test(n,k,tmp2)
   ...: 
   ...: 
1.98 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

dmolina · April 10, 2021, 8:58pm

I think you are right:

function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n); tmp1 = similar(t) 
               for j in 1:k
                   rand!(tmp1)
                   tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t.*tmp1
               end
           end
       end;

julia> @time test(n, k, tmp2)
 1.625298 seconds (3.66 k allocations: 1.227 GiB, 0.60% gc time)
julia> @time test2(n,k,tmp2)
  0.325837 seconds (1.92 k allocations: 659.253 MiB, 0.62% gc time)

I only have changed the matrix to broadcast multiplication and use rand! to avoid so many allocations.

@Semi, that version is faster than python version, didn’t it?

oheil · April 10, 2021, 9:01pm

At OPs side it is a MacBook pro running Julia 1.4, which also seems to make a difference here.

I can only check the Julia code, and for it, it went down from

julia> @btime test($n,$k,$tmp2)
  1.137 s (3660 allocations: 1.23 GiB)

in the original Julia code to

@btime test($n,$k,$tmp2)
  809.509 ms (1862 allocations: 639.34 MiB)

using mul! and @inbounds.

Running in Julia 1.5.3 is significantly slower but performance gain using mul! is similar.

oheil · April 10, 2021, 9:03pm

Please read the complete thread.
We are comparing matrix multiplication:
Julia: t*tmp1
Python: t@tmp1
which OP has changed during discussion.
Maybe OP could edit his original post?

dmolina · April 10, 2021, 9:10pm

Yes, you are right. I have read the first email, and I did not see the change in the multiplication in the email. I think your proposal (using mul!) is the best option:

function test2(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n); tmp1 = similar(t) 
               for j in 1:k
                   rand!(tmp1)
                   @views @inbounds mul!(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n], t, tmp1)
               end
           end
       end

Comparing both versions in my slow laptop (with Julia 1.6):

# Original code
julia> @time test(n,k,tmp2)
  1.806398 seconds (3.66 k allocations: 1.227 GiB, 1.27% gc time)
# New version
julia> @time test2(n,k,tmp2)
  1.453730 seconds (120 allocations: 41.203 MiB, 0.11% gc time)

It takes 33% less time than original (and a lot less memory).

Update: My laptop is clearly slower than I though, I need another one .

oheil · April 10, 2021, 9:22pm

julia> @btime test2(n,k,tmp2)
  663.240 ms (120 allocations: 41.20 MiB)

@dmolina 's version is best for me

danielw2904 · April 10, 2021, 10:05pm

I cannot reproduce this at all:

%timeit test(n, k, tmp2)
3.27 s ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Original version in jl.

@btime test(n, k, tmp2)
  1.648 s (3660 allocations: 1.23 GiB)

Version similar to @dmolina

julia> function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
                  tmp1=Matrix{Float64}(undef,n,n);t=Matrix{Float64}(undef,n,n);
                  for j in 1:k
                      Random.rand!(t) 
                      for i in 1:k
                          Random.rand!(tmp1)
                          mul!(@inbounds(@view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n])), t, tmp1)
                      end
                  end
              end;

julia> @btime test(n, k, tmp2)
  1.046 s (4 allocations: 1.37 MiB)

All with n=300;k=30
EDIT: No idea how to optimize the python version

lmiq · April 11, 2021, 12:34am

I cannot see that difference here either. But I noticed that

both are using multi-threading. But Julia is launching 8 threads, while Python is launching 4. My laptop has 4 physical cores. Maybe that explains differences in some system?
The times measured fluctuate in about 30% in both cases. Here I can get something between 1.2 and 0.9 seconds for either of them (using %timeit or @btime).

(I’m using Julia 1.6).

Codes

import numpy
from numpy import zeros
from numpy.random import rand
def test(n,k,tmp2):
    for i in range(k):
        t = rand(n, n) 
        for j in range(k):
            tmp1 = rand(n, n)
            tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1
n = 300
k = 30
tmp2=zeros((n*k,n*k))
%timeit test(n, k, tmp2)
# 1.18 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

julia> function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
           for i in 1:k
               t = rand(n, n) 
               for j in 1:k
                   tmp1 = rand(n, n)
                   tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
               end
           end
       end;

julia> n = 300
300

julia> k = 30
30

julia> tmp2=zeros((n*k,n*k));

julia> using BenchmarkTools

julia> @btime test($n,$k,$tmp2)
  1.016 s (3660 allocations: 1.23 GiB)

Topic		Replies	Views
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3557	October 24, 2017
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2103	August 13, 2017
Why is python faster than Julia Performance	14	1874	March 12, 2020
Julia is slower than Python when appending elements to untyped arrays General Usage question	42	807	February 10, 2025
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10838	October 15, 2019

Why is this simple function twice as slow as its Python version

Related topics