Python3+@njit faster than julia (rs)

Consider the simple following test:

julia> using Chairmarks

julia> function test(f, m)
           size = length(f)
           for i in 1:size
               for j in 1:size
                   dij = m[i,j]
                   f[i] = dij
                   f[j] = -dij 
               end
           end
           return f
       end
test (generic function with 1 method)

julia> f = zeros(1000); m = rand(1000,1000);

julia> @b test($f, $m)
797.525 μs

The same runs faster in python, using the numba @njit:

In [9]: import numpy as np

In [10]: from numba import njit, prange

In [11]: @njit(parallel=False, fastmath=False)
    ...: def test(f, m):
    ...:     size = len(f)
    ...:     for i in prange(size):
    ...:         for j in prange(size):
    ...:             dij = m[i,j]
    ...:             f[i] = dij
    ...:             f[j] = -dij
    ...:     return f
    ...: 

In [13]: f = np.zeros(1000)

In [14]: m = np.random.random((1000,1000))

In [15]: %timeit test(f,m)
455 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I know how to rewrite the code in Julia to use @turbo and make that faster in this case, that is not the point. Is there a good reason for the Julia compiler not doing the same job as the @njit compiler here? Or is there other reason for the difference?

(@inbounds only does not make a difference there)

Reversing the loops gave me a significant performance spead-up in julia:

# your original function
julia> @btime test($f, $m);
  1.945 ms (0 allocations: 0 bytes)


julia> function test2(f, m)
                 size = length(f)
                 for j in 1:size
                     for i in 1:size
                         dij = m[i,j]
                         f[i] = dij
                         f[j] = -dij 
                     end
                 end
                 return f
             end
test2 (generic function with 1 method)

julia> @btime test2($f, $m);
  799.093 μs (0 allocations: 0 bytes)

Oh, yes, of course… dumb. Thanks.

I am wondering if julia is now on par with numba ?

Yes it is. But my actual code is different from that, and I may update the post if I don´t find the reason for the difference now.

Surprisingly even with “julia -O3”, the loops in test(f,m) are apparently not reordered.

Loop re-ordering is a very hard compiler optimization. LLVM currently does not do very much/well for loop optimizations like this.

Isn’t this the performance tip Access arrays in memory order, along columns? You’re looping over m[i,j] in the row-major order, which is better for languages like C and Python, while column-major is what Fortran and Julia use. Having the same exact code for Julia/Fortran and C/Python is pretty much guaranteed to favour one set of languages over the other when there’s this pattern.

Yes, of course, I’m simplifying another code where I have the performance issue and ended with that MWE. But it does not capture the original problem, which I’ll have to dig further.

Isn’t this what LoopModels is supposed to do? That said, since Chris has moved to Modular I have a nagging suspicion that LoopModels probably won’t see much more development :pensive_face:.

Oh no! Damn the Python universe, it just drowns out everything.