# What is the cause of this erformance difference between Julia and Cython?

Hello all,

I’m fairly new to Julia. I have a simple array function I’m using for comparing the speed of different languages.
In this case Julia and Cython (subset of Python).

The function in Julia is:

``````function compute_array_normal(m, n)
x = zeros(Int32, (m, n))
for i = 0:m - 1
for j = 0:n - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end
``````

The multiprocess version in Julia is:

``````using Distributed
using SharedArrays

function compute_array_distributed(m, n)
x = SharedArray{Int32}(m, n)
@sync @distributed for i = 0:m - 1
for j = 0:n - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end
``````

The Cython version is:

``````# cython: infer_types=True
# distutils: extra_compile_args = -fopenmp
cimport cython
from cython import nogil
from cython.parallel import prange
from cython.view cimport array as cvarray

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef cvarray compute_array_cython(int m, int n):
cdef cvarray x = cvarray(shape=(m, n), itemsize=sizeof(int), format="i")
cdef int [:, ::1] x_view = x
cdef int i, j
for i in prange(m, nogil=True):
for j in prange(n):
x_view[i, j] = i*i + j*j
return x
``````

running the Julia versions with `--check-bounds=no` :

``````#  to compile the function the first time
@everywhere t0 = 100
compute_array_normal(t0, t0)
compute_array_distributed(t0, t0)

@everywhere t = 10000
@time println(compute_array_normal(t, t)[t, t])
@time println(compute_array_distributed(t, t)[t, t])
``````

I get:

``````199960002
0.803119 seconds (180.73 k allocations: 392.197 MiB, 0.84% gc time, 9.57% compilation time)
199960002
0.415599 seconds (23.74 k allocations: 1.287 MiB, 2.79% compilation time)
``````

running the Cython version, after compiling:

``````import time
import compute_array as *
t = 10000

s = time()
print(compute_array_cython(t, t)[t-1, t-1])
print(time() - s)
``````

I get:

``````199960002
0.0892179012298584  seconds
``````

Cython being so much faster seems off to me, especially with how fast Julia can be. I’ve gone over everything twice and can’t see what the cause is, and feel I’ve overlooked something, hence making this post.

Any advice on how to speed this function up in Julia would be greatly appreciated

1 Like

Your function is already faster than you think. It could be improved marginally by adding `@inbounds`, and by not filling the array with zeros first. And more substantially by making `i` the inner loop, to treat elements adjacent in memory soon after each other.

At this size I think you want to parallelise with Threads not Distributed:

``````julia> @time println(compute_array_normal(t, t)[t, t])
1996002
0.002596 seconds (11 allocations: 3.815 MiB)  # i.e. 2.6 ms

julia> using BenchmarkTools

julia> @btime compute_array_normal(1000, 1000);
777.709 μs (2 allocations: 3.81 MiB)

x = Array{Int32}(undef, (m, n))
for i = 0:m - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end;

julia> @btime compute_array(1000, 1000); # without @threads
628.875 μs (2 allocations: 3.81 MiB)

198.875 μs (23 allocations: 3.82 MiB)
``````
6 Likes

Loop vectorization helps also quite a bit:

``````julia> using BenchmarkTools

julia> n = 10000;

julia> @btime compute_array_normal(\$n,\$n);
649.357 ms (2 allocations: 381.47 MiB)

julia> @btime compute_array_normal2(\$n,\$n);
116.450 ms (2 allocations: 381.47 MiB)

44.573 ms (19 allocations: 381.47 MiB)

``````

With:

``````function compute_array_normal(m, n)
x = zeros(Int32, (m, n))
for i = 0:m - 1
for j = 0:n - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end

using LoopVectorization
function compute_array_normal2(m, n)
x = Matrix{Int32}(undef,m,n)
for j = 0:n - 1
@avx for i = 0:m - 1
x[i+1, j+1] = i*i + j*j
end
end
return x
end

x = Matrix{Int32}(undef,m,n)
@avx for i = 0:m - 1
x[i+1, j+1] = i*i + j*j
end
end
return x
end

``````
2 Likes

FYI, since `@threads` creates a closure and `@inbounds` does not penetrate closures, you need to put `@inbounds` inside `@threads` (as in `@inside x[i+1, j+1] = Int32(i*i + j*j)`).

The difference is probably very hard to observe for this kind of code, though. It’s easy to observe this with a bit heavier SIMD’able computation is indie. For example:

``````julia> function f!(ys ,xs)
ys[i] = Base.FastMath.sqrt_fast(xs[i])
end
end;

julia> function g!(ys ,xs)
@inbounds ys[i] = Base.FastMath.sqrt_fast(xs[i])
end
end;

julia> xs = ones(2^25); ys = similar(xs);

julia> @btime f!(ys ,xs)
13.981 ms (41 allocations: 3.64 KiB)

julia> @btime g!(ys ,xs)
5.630 ms (41 allocations: 3.64 KiB)

8
``````
3 Likes

Thanks, I missed that. (And apparently I have made this mistake before: `using Einsum; @macroexpand @vielsum A[i] := B[i,j]` gets it wrong.)

Weirdly your examples show no difference for me, although the one above does:

``````julia> @btime f!(ys ,xs)
8.994 ms (21 allocations: 1.86 KiB)

julia> @btime g!(ys ,xs)
9.003 ms (21 allocations: 1.86 KiB)

199.375 μs (23 allocations: 3.82 MiB)

julia> @btime compute_array_tkf(10^3);
173.083 μs (22 allocations: 3.82 MiB)
``````

Should this be documented somewhere, or is it? `@inbounds` looks like it applies to blocks of code like `@views` does, but the mechanism is different. Or can it be fixed? `@propagate_inbounds` in front of the function being defined does not seem to change anything.

1 Like

Oh, interesting that there’s no difference between `f!` and `g!` in your machine. Maybe it’s CPU-dependent thing? If I remove `@threads` from `g!` and look at LLVM IR, I see `call fast <4 x double> @llvm.sqrt.v4f64(<4 x double> %wide.load)`.

Yeah, I just noticed that `@inbounds` docstring does not mention the interaction with closures. I also couldn’t find it in the documentation with a quick search. I agree it makes sense to mention it in the docstring (and maybe also phrase it so that closure may be supported in the future, to let us fix it at some point if somebody comes up with a brilliant idea).

Yeah, it sounds difficult to support all cases and also not clear what should happen; e.g.,

``````xs = [1]
@inbounds begin
f(i) = xs[i]
f(1) # I think it's OK to elide bound check?
end
f(2)  # throw? segfault?
``````
1 Like

This alone makes a factor of 3 improvement:

``````julia> n = 10000;
julia> @btime compute_array_normal(\$n,\$n);
693.562 ms (2 allocations: 381.47 MiB)

julia> @btime compute_array_normal_ji(\$n,\$n);
235.468 ms (2 allocations: 381.47 MiB)
``````
Code
``````function compute_array_normal_ji(m, n)
x = zeros(Int32, (m, n))
for j = 0:n - 1
for i = 0:m - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end
``````
1 Like

Is this time multi-threaded (with 3 threads)? If it is, I think that exchanging the `i` and `j` loops is the major difference (Julia is column-major).

Probably the cython version is performing some level of loop-optimization, which can be achieved with the `@simd` macro or, more aggressively, with the `@avx` macro of the loop-vectorization package. Adding `@simd` makes your serial version about 3.5x faster, and with `@avx` it is 5.5x faster

Finally, just for fun, I’ve added one version in which you don’t need to specificy if `x` is of 32 or 64 bits, and that is defined by the type of `n` and `m` on input, which is shows how generic code can be written for Julia without any performance penalty.

``````julia> include("./smp.jl")
orig:    653.444 ms (2 allocations: 381.47 MiB)
swap:  266.822 ms (2 allocations: 381.47 MiB)
swap_threaded:   163.544 ms (18 allocations: 381.47 MiB)
swap_simd:   198.080 ms (2 allocations: 381.47 MiB)
swap_lv:   119.363 ms (2 allocations: 381.47 MiB)
lv_threaded:   45.515 ms (18 allocations: 381.47 MiB)
generic64:   94.271 ms (18 allocations: 762.94 MiB)
generic32:   47.493 ms (18 allocations: 381.47 MiB)

``````
Code
``````# original
function compute_array_normal(m, n)
x = zeros(Int32, (m, n))
for i = 0:m - 1
for j = 0:n - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end

# swap i and j loops
function compute_array_normal_swap(m, n)
x = zeros(Int32, (m, n))
for j = 0:n - 1
for i = 0:m - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end

x = zeros(Int32, (m, n))
for i = 0:m - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end

function compute_array_normal_swap_simd(m, n)
x = zeros(Int32, (m, n))
@inbounds for j = 0:n - 1
@simd for i = 0:m - 1
x[i+1, j+1] = Int32(i*i + j*j)
end
end
return x
end

using LoopVectorization
# using LoopVectorization
function compute_array_normal_swap_lv(m, n)
x = Matrix{Int32}(undef,m,n)
for j = 0:n - 1
@avx for i = 0:m - 1
x[i+1, j+1] = i*i + j*j
end
end
return x
end

x = Matrix{Int32}(undef,m,n)
@avx for i = 0:m - 1
x[i+1, j+1] = i*i + j*j
end
end
return x
end

# any type of of Int
x = Matrix{T}(undef,m,n)
@avx for i = 0:m - 1
x[i+1, j+1] = i*i + j*j
end
end
return x
end

using BenchmarkTools
n = 10_000
print("        orig: ");@btime compute_array_normal(\$n, \$n);
print("        swap: ");@btime compute_array_normal_swap(\$n, \$n);
print("   swap_simd: ");@btime compute_array_normal_swap_simd(\$n, \$n);
print("     swap_lv: ");@btime compute_array_normal_swap_lv(\$n, \$n);