What is the cause of this performance difference between Julia and Cython?

Your function is already faster than you think. It could be improved marginally by adding @inbounds, and by not filling the array with zeros first. And more substantially by making i the inner loop, to treat elements adjacent in memory soon after each other.

At this size I think you want to parallelise with Threads not Distributed:

julia> @time println(compute_array_normal(t, t)[t, t])
1996002
  0.002596 seconds (11 allocations: 3.815 MiB)  # i.e. 2.6 ms

julia> using BenchmarkTools

julia> @btime compute_array_normal(1000, 1000);
  777.709 μs (2 allocations: 3.81 MiB)

julia> function compute_array_threads(m, n=m)
           x = Array{Int32}(undef, (m, n))
           @inbounds Threads.@threads for j = 0:n - 1
               for i = 0:m - 1
                   x[i+1, j+1] = Int32(i*i + j*j)
               end
           end
           return x
       end;

julia> @btime compute_array(1000, 1000); # without @threads
  628.875 μs (2 allocations: 3.81 MiB)

julia> @btime compute_array_threads(1000, 1000);
  198.875 μs (23 allocations: 3.82 MiB)
6 Likes