(This is a follow up to Julia vs Numba, having created a new discourse account)
I came across this article on the performance of Julia vs Python+Numba, and this section interested me. It claims parallel Julia code is quite slower than Numba code. I didn’t want to trust it just yet, so I did my own benchmark (8 threads i5-11320H @ 3.2GHz):
# julia -t8 loop_julia.jl
using BenchmarkTools
function loop_julia(a::Vector{Float32}, b::Vector{Float32})
x::Vector{Float32} = zeros(Float32, length(a))
Threads.@threads for i in 1:length(a)
if a[i] < b[i]
x[i] = 0.
else
x[i] = 1.
end
end
return x
end
function main()
a = randn(Float32, 1_000_000)
b = randn(Float32, 1_000_000)
@btime loop_julia($a, $b)
end
main()
The result is 1.44ms average.
# python loop_numba.py
import timeit
import numpy as np
from numba import njit, prange
@njit(parallel=True)
def loop_numba(a, b):
x = np.zeros(a.shape, dtype=np.float32)
for i in prange(a.shape[0]):
if a[i] < b[i]:
x[i] = 0.
else:
x[i] = 1.
return x
def main():
setup = "a = np.random.randn(1_000_000).astype(np.float32); b = np.random.randn(1_000_000).astype(np.float32)"
t = timeit.timeit("loop_numba(a, b)", setup, globals=globals(), number=10000)
print(t / 10000 * 1000)
main()
The result is 0.83ms average.
The claims in the article are not wrong, Julia is indeed behind in performance when it comes to parallel computing. But where does the performance difference come from? This is a staggering x2 performance difference, which I didn’t expect before the benchmark.
Although optimized Numba code is a bit like C code, as I recall someone on discourse. But both tools using LLVM+multi-threading having a x2 perf difference should be due to something else.