What is the cause of this performance difference between Julia and Cython?

Hello all,

I’m fairly new to Julia. I have a simple array function I’m using for comparing the speed of different languages.
In this case Julia and Cython (subset of Python).

The function in Julia is:

function compute_array_normal(m, n)
    x = zeros(Int32, (m, n))
    for i = 0:m - 1
        for j = 0:n - 1
            x[i+1, j+1] = Int32(i*i + j*j)
        end
    end
    return x
end

The multiprocess version in Julia is:

using Distributed
using SharedArrays

addprocs(3)

function compute_array_distributed(m, n)
    x = SharedArray{Int32}(m, n)
    @sync @distributed for i = 0:m - 1
        for j = 0:n - 1
            x[i+1, j+1] = Int32(i*i + j*j)
        end
    end
    return x
end

The Cython version is:

# cython: infer_types=True
# distutils: extra_compile_args = -fopenmp
# distutils: extra_link_args = -fopenmp
cimport cython
from cython import nogil
from cython.parallel import prange
from cython.view cimport array as cvarray

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef cvarray compute_array_cython(int m, int n):
    cdef cvarray x = cvarray(shape=(m, n), itemsize=sizeof(int), format="i")
    cdef int [:, ::1] x_view = x
    cdef int i, j
    for i in prange(m, nogil=True):
        for j in prange(n):
            x_view[i, j] = i*i + j*j
    return x

running the Julia versions with --check-bounds=no :

#  to compile the function the first time
@everywhere t0 = 100
compute_array_normal(t0, t0)
compute_array_distributed(t0, t0)

@everywhere t = 10000
@time println(compute_array_normal(t, t)[t, t])
@time println(compute_array_distributed(t, t)[t, t])

I get:

199960002
  0.803119 seconds (180.73 k allocations: 392.197 MiB, 0.84% gc time, 9.57% compilation time)
199960002
  0.415599 seconds (23.74 k allocations: 1.287 MiB, 2.79% compilation time)

running the Cython version, after compiling:

import time 
import compute_array as *
t = 10000

s = time()
print(compute_array_cython(t, t)[t-1, t-1])
print(time() - s)

I get:

199960002
0.0892179012298584  seconds

Cython being so much faster seems off to me, especially with how fast Julia can be. I’ve gone over everything twice and can’t see what the cause is, and feel I’ve overlooked something, hence making this post.

Any advice on how to speed this function up in Julia would be greatly appreciated :slight_smile:

1 Like

Your function is already faster than you think. It could be improved marginally by adding @inbounds, and by not filling the array with zeros first. And more substantially by making i the inner loop, to treat elements adjacent in memory soon after each other.

At this size I think you want to parallelise with Threads not Distributed:

julia> @time println(compute_array_normal(t, t)[t, t])
1996002
  0.002596 seconds (11 allocations: 3.815 MiB)  # i.e. 2.6 ms

julia> using BenchmarkTools

julia> @btime compute_array_normal(1000, 1000);
  777.709 μs (2 allocations: 3.81 MiB)

julia> function compute_array_threads(m, n=m)
           x = Array{Int32}(undef, (m, n))
           @inbounds Threads.@threads for j = 0:n - 1
               for i = 0:m - 1
                   x[i+1, j+1] = Int32(i*i + j*j)
               end
           end
           return x
       end;

julia> @btime compute_array(1000, 1000); # without @threads
  628.875 μs (2 allocations: 3.81 MiB)

julia> @btime compute_array_threads(1000, 1000);
  198.875 μs (23 allocations: 3.82 MiB)
6 Likes

Loop vectorization helps also quite a bit:

julia> using BenchmarkTools

julia> n = 10000;

julia> @btime compute_array_normal($n,$n);
  649.357 ms (2 allocations: 381.47 MiB)

julia> @btime compute_array_normal2($n,$n);
  116.450 ms (2 allocations: 381.47 MiB)

julia> @btime compute_array_normal2_threaded($n,$n); # 3 threads
  44.573 ms (19 allocations: 381.47 MiB)

With:

function compute_array_normal(m, n)
    x = zeros(Int32, (m, n))
    for i = 0:m - 1
        for j = 0:n - 1
            x[i+1, j+1] = Int32(i*i + j*j)
        end
    end
    return x
end

using LoopVectorization
function compute_array_normal2(m, n)
    x = Matrix{Int32}(undef,m,n)
    for j = 0:n - 1
        @avx for i = 0:m - 1
            x[i+1, j+1] = i*i + j*j
        end
    end
    return x
end

function compute_array_normal2_threaded(m, n)
    x = Matrix{Int32}(undef,m,n)
    Threads.@threads for j = 0:n - 1
        @avx for i = 0:m - 1
            x[i+1, j+1] = i*i + j*j
        end
    end
    return x
end


2 Likes

FYI, since @threads creates a closure and @inbounds does not penetrate closures, you need to put @inbounds inside @threads (as in @inbounds x[i+1, j+1] = Int32(i*i + j*j)).

The difference is probably very hard to observe for this kind of code, though. It’s easy to observe this with a bit heavier SIMD’able computation is indie. For example:

julia> function f!(ys ,xs)
           @inbounds Threads.@threads for i in eachindex(xs, ys)
               ys[i] = Base.FastMath.sqrt_fast(xs[i])
           end
       end;

julia> function g!(ys ,xs)
           Threads.@threads for i in eachindex(xs, ys)
               @inbounds ys[i] = Base.FastMath.sqrt_fast(xs[i])
           end
       end;

julia> xs = ones(2^25); ys = similar(xs);

julia> @btime f!(ys ,xs)
  13.981 ms (41 allocations: 3.64 KiB)

julia> @btime g!(ys ,xs)
  5.630 ms (41 allocations: 3.64 KiB)

julia> Threads.nthreads()
8
5 Likes

Thanks, I missed that. (And apparently I have made this mistake before: using Einsum; @macroexpand @vielsum A[i] := B[i,j] gets it wrong.)

Weirdly your examples show no difference for me, although the one above does:

julia> @btime f!(ys ,xs)
  8.994 ms (21 allocations: 1.86 KiB)

julia> @btime g!(ys ,xs)
  9.003 ms (21 allocations: 1.86 KiB)

julia> @btime compute_array_threads(10^3);
  199.375 μs (23 allocations: 3.82 MiB)

julia> @btime compute_array_tkf(10^3);
  173.083 μs (22 allocations: 3.82 MiB)

Should this be documented somewhere, or is it? @inbounds looks like it applies to blocks of code like @views does, but the mechanism is different. Or can it be fixed? @propagate_inbounds in front of the function being defined does not seem to change anything.

1 Like

Oh, interesting that there’s no difference between f! and g! in your machine. Maybe it’s CPU-dependent thing? If I remove @threads from g! and look at LLVM IR, I see call fast <4 x double> @llvm.sqrt.v4f64(<4 x double> %wide.load).

Yeah, I just noticed that @inbounds docstring does not mention the interaction with closures. I also couldn’t find it in the documentation with a quick search. I agree it makes sense to mention it in the docstring (and maybe also phrase it so that closure may be supported in the future, to let us fix it at some point if somebody comes up with a brilliant idea).

Yeah, it sounds difficult to support all cases and also not clear what should happen; e.g.,

xs = [1]
@inbounds begin
    f(i) = xs[i]
    f(1) # I think it's OK to elide bound check?
end
f(2)  # throw? segfault?
1 Like

This alone makes a factor of 3 improvement:

julia> n = 10000;
julia> @btime compute_array_normal($n,$n);
  693.562 ms (2 allocations: 381.47 MiB)

julia> @btime compute_array_normal_ji($n,$n);
  235.468 ms (2 allocations: 381.47 MiB)
Code
function compute_array_normal_ji(m, n)
    x = zeros(Int32, (m, n))
    for j = 0:n - 1
        for i = 0:m - 1
            x[i+1, j+1] = Int32(i*i + j*j)
        end
    end
    return x
end
2 Likes

Is this time multi-threaded (with 3 threads)? If it is, I think that exchanging the i and j loops is the major difference (Julia is column-major).

Probably the cython version is performing some level of loop-optimization, which can be achieved with the @simd macro or, more aggressively, with the @avx macro of the loop-vectorization package. Adding @simd makes your serial version about 3.5x faster, and with @avx it is 5.5x faster

Finally, just for fun, I’ve added one version in which you don’t need to specificy if x is of 32 or 64 bits, and that is defined by the type of n and m on input, which is shows how generic code can be written for Julia without any performance penalty.

julia> include("./smp.jl")
         orig:    653.444 ms (2 allocations: 381.47 MiB)
         swap:  266.822 ms (2 allocations: 381.47 MiB)
swap_threaded:   163.544 ms (18 allocations: 381.47 MiB)
    swap_simd:   198.080 ms (2 allocations: 381.47 MiB)
      swap_lv:   119.363 ms (2 allocations: 381.47 MiB)
  lv_threaded:   45.515 ms (18 allocations: 381.47 MiB)
    generic64:   94.271 ms (18 allocations: 762.94 MiB)
    generic32:   47.493 ms (18 allocations: 381.47 MiB)

Code
# original
function compute_array_normal(m, n)
    x = zeros(Int32, (m, n))
    for i = 0:m - 1
       for j = 0:n - 1
            x[i+1, j+1] = Int32(i*i + j*j)
       end
    end
    return x
end

# swap i and j loops
function compute_array_normal_swap(m, n)
    x = zeros(Int32, (m, n))
    for j = 0:n - 1
        for i = 0:m - 1
            x[i+1, j+1] = Int32(i*i + j*j)
        end
    end
    return x
end

# swaped and threaded
function compute_array_normal_swap_threads(m, n)
    x = zeros(Int32, (m, n))
    Threads.@threads for j = 0:n - 1
        for i = 0:m - 1
            x[i+1, j+1] = Int32(i*i + j*j)
        end
    end
    return x
end

# add simd operations
function compute_array_normal_swap_simd(m, n)
    x = zeros(Int32, (m, n))
    @inbounds for j = 0:n - 1
        @simd for i = 0:m - 1
            x[i+1, j+1] = Int32(i*i + j*j)
        end
    end
    return x
end

using LoopVectorization
# using LoopVectorization
function compute_array_normal_swap_lv(m, n)
    x = Matrix{Int32}(undef,m,n)
    for j = 0:n - 1
        @avx for i = 0:m - 1
            x[i+1, j+1] = i*i + j*j
        end
    end
    return x
end

# threaded
function compute_array_normal_swap_lv_threaded(m, n)
    x = Matrix{Int32}(undef,m,n)
    Threads.@threads for j = 0:n - 1
        @avx for i = 0:m - 1
            x[i+1, j+1] = i*i + j*j
        end
    end
    return x
end

# any type of of Int 
function compute_array_normal_swap_lv_threaded_generic(m::T, n::T) where T<:Integer
    x = Matrix{T}(undef,m,n)
    Threads.@threads for j = 0:n - 1
        @avx for i = 0:m - 1
            x[i+1, j+1] = i*i + j*j
        end
    end
    return x
end

using BenchmarkTools
n = 10_000
print("        orig: ");@btime compute_array_normal($n, $n);
print("        swap: ");@btime compute_array_normal_swap($n, $n);
print("   swap_simd: ");@btime compute_array_normal_swap_simd($n, $n);
print("     swap_lv: ");@btime compute_array_normal_swap_lv($n, $n); 
print(" lv_threaded: ");@btime compute_array_normal_swap_lv_threaded($n, $n);
print("   generic64: ");@btime compute_array_normal_swap_lv_threaded_generic($n, $n);
print("   generic32: ");@btime compute_array_normal_swap_lv_threaded_generic(Int32($n), Int32($n));
nothing

11 Likes

There’s a recent benchmark (via HN discussion) here that draws attention to extending Python projects to optimize lists, which compares several Cython builds and puts Julia in the lead:

https://github.com/00sapo/cython_list_test

1 Like

The HN comments claim the Cython benchmark is bad.