Python3+@njit faster than julia (rs)

Consider the simple following test:

julia> using Chairmarks

julia> function test(f, m)
           size = length(f)
           for i in 1:size
               for j in 1:size
                   dij = m[i,j]
                   f[i] = dij
                   f[j] = -dij 
               end
           end
           return f
       end
test (generic function with 1 method)

julia> f = zeros(1000); m = rand(1000,1000);

julia> @b test($f, $m)
797.525 μs

The same runs faster in python, using the numba @njit:

In [9]: import numpy as np

In [10]: from numba import njit, prange

In [11]: @njit(parallel=False, fastmath=False)
    ...: def test(f, m):
    ...:     size = len(f)
    ...:     for i in prange(size):
    ...:         for j in prange(size):
    ...:             dij = m[i,j]
    ...:             f[i] = dij
    ...:             f[j] = -dij
    ...:     return f
    ...: 

In [13]: f = np.zeros(1000)

In [14]: m = np.random.random((1000,1000))

In [15]: %timeit test(f,m)
455 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I know how to rewrite the code in Julia to use @turbo and make that faster in this case, that is not the point. Is there a good reason for the Julia compiler not doing the same job as the @njit compiler here? Or is there other reason for the difference?

(@inbounds only does not make a difference there)

1 Like

Reversing the loops gave me a significant performance spead-up in julia:

# your original function
julia> @btime test($f, $m);
  1.945 ms (0 allocations: 0 bytes)


julia> function test2(f, m)
                 size = length(f)
                 for j in 1:size
                     for i in 1:size
                         dij = m[i,j]
                         f[i] = dij
                         f[j] = -dij 
                     end
                 end
                 return f
             end
test2 (generic function with 1 method)

julia> @btime test2($f, $m);
  799.093 μs (0 allocations: 0 bytes)
11 Likes

Oh, yes, of course… dumb. Thanks.

1 Like

I am wondering if julia is now on par with numba ?

Yes it is. But my actual code is different from that, and I may update the post if I don´t find the reason for the difference now.

1 Like

Surprisingly even with “julia -O3”, the loops in test(f,m) are apparently not reordered.

Loop re-ordering is a very hard compiler optimization. LLVM currently does not do very much/well for loop optimizations like this.

3 Likes

Isn’t this the performance tip Access arrays in memory order, along columns? You’re looping over m[i,j] in the row-major order, which is better for languages like C and Python, while column-major is what Fortran and Julia use. Having the same exact code for Julia/Fortran and C/Python is pretty much guaranteed to favour one set of languages over the other when there’s this pattern.

7 Likes

Yes, of course, I’m simplifying another code where I have the performance issue and ended with that MWE. But it does not capture the original problem, which I’ll have to dig further.

1 Like

Isn’t this what LoopModels is supposed to do? That said, since Chris has moved to Modular I have a nagging suspicion that LoopModels probably won’t see much more development :pensive_face:.

2 Likes

Oh no! Damn the Python universe, it just drowns out everything.

4 Likes