Why is this simple function twice as slow as its Python version

I’m in the planing stages for writing a piece of scientific software and I would like to use Julia for it, but I’m surprised how much slower Julia is from Python when tested on this simple code:

function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
    for i in 1:k
        t = rand(n, n) 
        for j in 1:k
            tmp1 = rand(n, n)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
        end
    end
end;

The corresponding python code is

def test(n,k,tmp2):
    for i in range(k):
        t = rand(n, n) 
        for j in range(k):
            tmp1 = rand(n, n)
            tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1

When tested with, for example,

n = 300
k = 30
tmp2=zeros((n*k,n*k))

Julia needs more than a second

@time test(n,k,tmp2)
1.122870 seconds (3.67 k allocations: 1.227 GiB, 15.44% gc time)
while python is 30% faster
%%timeit
test(n,k,tmp2)
853 ms ± 7.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

When testing the same function but with
tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
changed to some element-wise operation on the array, Julia is much faster than python, so I suppose this has to do with matrix operations.

I’m using Julia 1.4 on a Macbook pro. I tried switching from openblas to mkl but this did not improved the speed of Julia.

I would be thankful for any hint what I’m doing wrong, or Julia indeed is so slow when dealing with lots of matrix operations. :slightly_smiling_face:

1 Like

Hey there!

Have you tried running the Julia code more than once? The first time Julia runs a function it’ll compile a fast version of it, you might have timed how long it takes to compile rather than just the function!

t*tmp1

In Python, it’s an elementwise multiplication, while in Julia it’s matrix-by-matrix multiplication.

9 Likes

Good catch!

But with t@tmp1, python is still quite a bit faster
853 ms ± 7.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Yes, I did that.

Your python code doesn’t run for me. Can you add the necessary imports and prefixes so I can check from copying and pasting?

Hi, try switching the order of the loops i and j (python is row major vs Julia is column major).

5 Likes

Sure,

import numpy
from numpy import zeros
from numpy.random import rand

This did not make any difference.

And please change t*tmp1 to t@tmp1, so that the comparison makes sense.

Stepping back a bit, I don’t think the thing that is slow here is “matrix operations”. You are creating lots and lots of somewhat large random matrices, as evidenced by the allocation numbers. If you just had 2 temporary arrays that you reused, you would see big speedups.

1 Like

Thanks!

Interestingly, I don’t see the performance gain you mention.


In [8]: %timeit test(n,k,tmp2)
1.26 s ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vs.

julia> @btime test($n,$k,$tmp2)
  1.178 s (3660 allocations: 1.23 GiB)

It seems this is not the case.
For example

function test(n)
    k = 100
    t = rand(n, n)
    tmp1 = rand(n, n) 
    for i in 1:k
        tmp1 = t + t;
    end
    for j in 1:k
        tmp1 = tmp1*t;
    end
end

gives me

@btime test(100)
8.380 ms (404 allocations: 15.43 MiB)

while

import numpy
from numpy.random import rand
def test(n):
    k = 100
    t = rand(n, n)
    tmp1 = rand(n, n) 
    for i in range(k):
        tmp1 = t + t;
    for j in range(k):
        tmp1 = tmp1@t;

gives me

%timeit test(100)
2.38 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That’s very interesting!
What are your specs?
My colleague also tried this on a Windows machine and he obtained similar results as I did. I don’t know which version of Julia he used.

Here:

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PKG_DEVDIR = /home/peterwd/Documents/Development
  JULIA_EDITOR = subl

Can you check:

using LinearAlgebra
function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
	buf=zeros(n,n)
    for i in 1:k
        t = rand(n, n) 
        for j in 1:k
            tmp1 = rand(n, n)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = mul!(buf, t, tmp1)
        end
    end
end
...

Julia has a different memory layout than Python, you’re doing the for-loop in the unfavourable order for Julia: Performance Tips · The Julia Language

6 Likes

This gives for me:

julia> @btime test($n,$k,$tmp2)
  866.969 ms (1862 allocations: 639.34 MiB)

But of course we have different systems and I refuse to install python now :wink:

Switching off index checking gives a few ms more:

function test(n::Int64,k::Int64,tmp2::Array{Float64,2})
	buf=zeros(n,n)
    @inbounds for i in 1:k
        t = rand(n, n) 
        for j in 1:k
            tmp1 = rand(n, n)
			tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = mul!(buf, t, tmp1)
        end
    end
end

julia> @btime test($n,$k,$tmp2)
  809.509 ms (1862 allocations: 639.34 MiB)

It’s very easy to figure out ways to speed up this function in Julia. But it remains an interesting, and unanswered, question why python is faster with “vanilla” code.

3 Likes