This example might be interesting to help understand what performance improvement still can be made in multi-threaded worklfows. There was a related discussion on memory management in numpy not long ago Why is Julia's performance more sensitive to memory allocations than Numpy's?.
Thanks to recent updates in PythonCall.jl
multithreading it is now possible to to call python from julia threads. I often notice GC
overhead in parallel loops so decided to explore if naively delegating memory management to numpy
can help - and it appears to work. There is also an implementation with Bumper.jl
but it is still slower than calling numpy
.
Iβd be interested to hear if an experiment like this is valid. The code is run with julia +release -t 8 --gcthreads 8,1 mem_alloc_pythoncall_script.jl
. I have also tried running on nightly
- no big improvements there.
@time f_py(); # 3.143514 seconds (4.12 M allocations: 206.531 MiB, 4.55% gc time, 423.79% compilation time)
@time f_jl(); # 1.498249 seconds (109.24 k allocations: 2.992 GiB, 24.85% gc time, 54.26% compilation time)
@time f_bumper(); # 0.918397 seconds (237.78 k allocations: 11.769 MiB, 206.76% compilation time)
GC.gc()
@time f_py(); # 0.398718 seconds (1.76 k allocations: 55.578 KiB)
GC.gc()
@time f_jl(); # 1.142241 seconds (164 allocations: 2.986 GiB, 36.85% gc time)
GC.gc()
@time f_bumper(); # 0.626195 seconds (100 allocations: 9.016 KiB)
And the code:
ENV["JULIA_CONDAPKG_BACKEND"] = "Null"
using PythonCall
np = pyimport("numpy")
using Bumper
function f_jl()
n = 5 * Threads.nthreads()
out = zeros(n)
Threads.@threads :static for i in 1:n
N = 10_000_000 + i*1000
arr = Vector{Float64}(undef, N)
arr .= 1
out[i] = minimum(arr)
end
out
end
function f_py()
n = 5 * Threads.nthreads()
out = zeros(n)
PythonCall.GIL.@unlock Threads.@threads :static for i in 1:n
N = 10_000_000 + i*1000 # make sizes different between calls
arr = PythonCall.GIL.@lock PyArray(np.zeros(N))
arr .= 1
out[i] = minimum(arr)
arr = nothing
PythonCall.GIL.@lock PythonCall.GC.gc() # trying to ensure fair comparison
end
out
end
function f_bumper()
n = 5 * Threads.nthreads()
out = zeros(n)
Threads.@threads :static for i in 1:n
@no_escape begin
N = 10_000_000 + i*1000
arr = @alloc(Float64, N)
arr .= 1
out[i] = minimum(arr)
end
end
out
end