Improving parallel loop performance by using `numpy` for allocations

Speed gains are visible in single-threaded workloads also. Seemingly they are not purely due to GC pauses - the timings are similar with GC disabled. Here is an example of copy and sum a vector. Having numpy managing memory gives a good speedup on both x86 and Mac, and for different (big-ish) vector size.
Timings:

julia +release --project=. -t 1 copyadd_time.jl 
  0.256724 seconds (64.30 k allocations: 79.529 MiB, 53.59% gc time, 17.37% compilation time) # compile
  2.745950 seconds (4.35 M allocations: 218.689 MiB, 3.69% gc time, 97.00% compilation time) # compile
  0.054762 seconds (46 allocations: 1.367 KiB) # numpy/pyarray
  0.074416 seconds (4 allocations: 76.294 MiB) # native julia

Code:

ENV["JULIA_CONDAPKG_BACKEND"] = "Null" # use system-wide python installation
# otherwise install numpy for this PythonCall environment:
# ] add CondaPkg; using CondaPkg; ] conda add numpy
using PythonCall
using Random
np = pyimport("numpy")
Random.seed!(42)

function copy_jl(arr)
    sum(copy(arr))
end

function copy_np(arr)
    pymem = np.empty(length(arr))
    pyarr = PyArray(pymem)
    pyarr .= arr
    ans = sum(pyarr)
    return ans
end

arr = rand(10_000_000)

@time copy_jl(arr)
@time copy_np(arr)
GC.gc()
GC.enable(false) # doesn't matter

@time copy_np(arr)
@time copy_jl(arr)
1 Like