Improving parallel loop performance by using `numpy` for allocations

artemsolod · November 8, 2024, 2:04pm

This example might be interesting to help understand what performance improvement still can be made in multi-threaded worklfows. There was a related discussion on memory management in numpy not long ago Why is Julia's performance more sensitive to memory allocations than Numpy's?.

Thanks to recent updates in PythonCall.jl multithreading it is now possible to to call python from julia threads. I often notice GC overhead in parallel loops so decided to explore if naively delegating memory management to numpy can help - and it appears to work. There is also an implementation with Bumper.jl but it is still slower than calling numpy.

I’d be interested to hear if an experiment like this is valid. The code is run with julia +release -t 8 --gcthreads 8,1 mem_alloc_pythoncall_script.jl. I have also tried running on nightly - no big improvements there.

@time f_py(); # 3.143514 seconds (4.12 M allocations: 206.531 MiB, 4.55% gc time, 423.79% compilation time)
@time f_jl(); # 1.498249 seconds (109.24 k allocations: 2.992 GiB, 24.85% gc time, 54.26% compilation time)
@time f_bumper(); # 0.918397 seconds (237.78 k allocations: 11.769 MiB, 206.76% compilation time)

GC.gc()
@time f_py(); # 0.398718 seconds (1.76 k allocations: 55.578 KiB)

GC.gc()
@time f_jl(); # 1.142241 seconds (164 allocations: 2.986 GiB, 36.85% gc time)

GC.gc()
@time f_bumper(); # 0.626195 seconds (100 allocations: 9.016 KiB)

And the code:

ENV["JULIA_CONDAPKG_BACKEND"] = "Null"
using PythonCall
np = pyimport("numpy")

using Bumper

function f_jl()
    n = 5 * Threads.nthreads()
    out = zeros(n)
    Threads.@threads :static for i in 1:n
        N = 10_000_000 + i*1000
        arr = Vector{Float64}(undef, N)
        arr .= 1
        out[i] = minimum(arr)
    end
    out
end

function f_py()
    n = 5 * Threads.nthreads()
    out = zeros(n)
    PythonCall.GIL.@unlock Threads.@threads :static for i in 1:n
        N = 10_000_000 + i*1000 # make sizes different between calls
        arr = PythonCall.GIL.@lock PyArray(np.zeros(N))
        arr .= 1
        out[i] = minimum(arr)
        arr = nothing
        PythonCall.GIL.@lock PythonCall.GC.gc() # trying to ensure fair comparison
    end
    out
end

function f_bumper()
    n = 5 * Threads.nthreads()
    out = zeros(n)
    Threads.@threads :static for i in 1:n
        @no_escape begin
            N = 10_000_000 + i*1000
            arr = @alloc(Float64, N)
            arr .= 1
            out[i] = minimum(arr)
        end
    end
    out
end

photor · November 8, 2024, 2:29pm

Interesting, I will give it a try.

artemsolod · November 10, 2024, 4:11pm

Speed gains are visible in single-threaded workloads also. Seemingly they are not purely due to GC pauses - the timings are similar with GC disabled. Here is an example of copy and sum a vector. Having numpy managing memory gives a good speedup on both x86 and Mac, and for different (big-ish) vector size.
Timings:

julia +release --project=. -t 1 copyadd_time.jl 
  0.256724 seconds (64.30 k allocations: 79.529 MiB, 53.59% gc time, 17.37% compilation time) # compile
  2.745950 seconds (4.35 M allocations: 218.689 MiB, 3.69% gc time, 97.00% compilation time) # compile
  0.054762 seconds (46 allocations: 1.367 KiB) # numpy/pyarray
  0.074416 seconds (4 allocations: 76.294 MiB) # native julia

Code:

ENV["JULIA_CONDAPKG_BACKEND"] = "Null" # use system-wide python installation
# otherwise install numpy for this PythonCall environment:
# ] add CondaPkg; using CondaPkg; ] conda add numpy
using PythonCall
using Random
np = pyimport("numpy")
Random.seed!(42)

function copy_jl(arr)
    sum(copy(arr))
end

function copy_np(arr)
    pymem = np.empty(length(arr))
    pyarr = PyArray(pymem)
    pyarr .= arr
    ans = sum(pyarr)
    return ans
end

arr = rand(10_000_000)

@time copy_jl(arr)
@time copy_np(arr)
GC.gc()
GC.enable(false) # doesn't matter

@time copy_np(arr)
@time copy_jl(arr)

Palli · November 10, 2024, 6:31pm

FYI: For me, Julia on recent nightly is only 6.5% slower (for min., despite, and probably because of GC activity) and Python was actually slower for mean (because of a bit loaded machine?, would also affect min) which is strange…:

julia> @benchmark f_jl()
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  252.914 ms … 274.440 ms  ┊ GC (min … max):  7.57% … 14.20%
 Time  (median):     269.352 ms               ┊ GC (median):    12.63%
 Time  (mean ± σ):   267.617 ms ±   5.511 ms  ┊ GC (mean ± σ):  12.21% ±  1.51%

  ▁   ▁                        ▁      ▁ ▁▁   █▁ ▁█ ▁█▁ █      ▁  
  █▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁██▁▁▁██▁██▁███▁█▁▁▁▁▁▁█ ▁
  253 ms           Histogram: frequency by time          274 ms <

 Memory estimate: 381.60 MiB, allocs estimate: 24.

julia> @benchmark f_py()
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  237.435 ms …    1.495 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     245.396 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   444.582 ms ± 400.081 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                              
  █▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  237 ms           Histogram: frequency by time           1.5 s <

 Memory estimate: 6.59 KiB, allocs estimate: 204.

julia> @btime f_py()
/bin/bash: line 1: 2581436 Killed                  ( julia +nightly )

More concerning it f_py there then failing…

@benchmark f_bumper() is actually slower for all including min., so not sure what’s happening there, this is all still single-threaded. I think getting best case for it first might be helpful before investigating multi-threaded.

Redoing (julia is sensitive to load, and with lightening it, wins across the board, I’ve yet to try for NumPy, then might win, but likely is very close either way):

$ killall -SIGSTOP firefox-bin

julia> @benchmark f_jl()
BenchmarkTools.Trial: 21 samples with 1 evaluation.
 Range (min … max):  226.414 ms … 274.126 ms  ┊ GC (min … max):  4.18% … 9.52%
 Time  (median):     242.641 ms               ┊ GC (median):    10.07%
 Time  (mean ± σ):   244.879 ms ±   8.331 ms  ┊ GC (mean ± σ):   9.99% ± 1.42%

I believe that PythonCall.GC.gc() is now a no-op…

I was only on defaults, single-threaded, am now investigating more.

The former is actually faster than for the Python code, but you might as well do just ones(N) which is as fast, and it’s O(n) like actually the code for Python.

This allocated more than similar Julia:

julia> @time np.zeros(1000)
  0.000062 seconds (17 allocations: 288 bytes)
julia> @time np.zeros(1000)
  0.000060 seconds (15 allocations: 256 bytes)

Not sure why more first time around, and it also depends on the size. Then this costs extra:

julia> @time PyArray(np.zeros(1000))
  0.000111 seconds (37 allocations: 1.188 KiB)

From memory it was no-cost to transfer memory from Julia to Python/Numpy, with PyCall, though here in other directly, wasn’t it also free?

artemsolod · November 10, 2024, 6:46pm

For this example on intel/Linux I played around with numpy settings and found disabling hugepage eliminates the performance gap.

ENV["NUMPY_MADVISE_HUGEPAGE"] = "0"
...
arr = rand(10_000_000)

@time copy_jl(arr)
@time copy_np(arr)
GC.gc()
GC.enable(false) # doesn't matter

@time copy_np(arr)
@time copy_jl(arr)

Results in

0.257229 seconds (64.30 k allocations: 79.529 MiB, 53.72% gc time, 17.32% compilation time)
  2.771781 seconds (4.35 M allocations: 218.689 MiB, 3.64% gc time, 96.28% compilation time)
  0.073100 seconds (46 allocations: 1.367 KiB)
  0.074810 seconds (4 allocations: 76.294 MiB)

artemsolod · November 10, 2024, 7:03pm

In python code I initialized with np.zeros(N) (not np.ones) and then set them to 1 trying to ensure that some smart lazy allocation magic is not triggered and so that f_py does at least the same amount of work.

artemsolod · November 10, 2024, 9:08pm

I have posted an issue about adding similar functionality to julia Madvise HUGEPAGE for large arrays · Issue #56521 · JuliaLang/julia · GitHub

Mason · November 10, 2024, 9:22pm

FWIW since the arrays are so large, the Bumper.jl version here is just going to be a more complicated wrapper around malloc/free, so you might as well just use those directly.

Palli · November 10, 2024, 11:30pm

@cdoris, Julia or maybe even PythonCall.jl seemingly has a multi-threaded bug. I seemingly ruled out a memory leak with single-threaded (only), but maybe the program might have a memory leak otherwise?

For multi-threaded even as low as -t2:

$ julia +nightly -t2 --project=. test_mem.jl
..
f_py 1
Killed

While with -t1 I get:

  3.261937 seconds (4.76 M allocations: 234.516 MiB, 1.04% gc time, 143.75% compilation time: 3% of which was recompilation)
  0.840262 seconds (111.32 k allocations: 768.734 MiB, 40.94% gc time, 28.88% compilation time)
  0.508282 seconds (248.02 k allocations: 12.001 MiB, 77.98% compilation time)
  0.520475 seconds (44 allocations: 763.384 MiB, 35.93% gc time)
  0.281239 seconds (28 allocations: 2.000 KiB)
..
f_py 37
  226.847 ms (209 allocations: 6.67 KiB)
f_py 38
  226.203 ms (209 allocations: 6.67 KiB)

I monitored in top, and both VIRT, RES and %MEM go up and down, e.g. the latter up to 33% and then down do 1.3%, here showing some samples over time:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND            
2600356 pharald+ 20 0 11,4g 10,6g 109268 R 99,7 34,0 4:18.15 julia

2600356 pharald+ 20 0 7599060 6,4g 109268 R 100,0 20,6 4:30.26 julia

2600356 pharald+ 20 0 1815812 978432 109268 R 100,0 3,0 4:39.35 julia

2600356 pharald+ 20 0 7755372 6,7g 109328 R 99,7 21,3 4:57.53 julia

This addition to the script seemingly goes forever when in single-threaded mode:

for i in 1:1000
    println("f_py $i")
    @btime f_py();
end

If I do a similar loop for f_jl, then %MEM goes up to 3,3% or there about, and down and up, a much smaller range.

..
f_jl 24
  316.331 ms (44 allocations: 763.38 MiB)

This line is a bit regrettable:

    n = 5 * Threads.nthreads()

I thought I was running the same program in multi-threaded (sort of true), and seeing the runtime go up was puzzling, but tying the workload linearly to the number of threads is I think not a good idea. It doesn’t explain, I think, why single-threaded never got killed, since I also tried with changing to:

    n = 100 * Threads.nthreads()

Palli · November 11, 2024, 2:05pm

Ideal speedup with multi-threaded would be linear, here 8x, but for this test it’s only 2.47x:

test_malloc(n, s1=200, s2=200) = Threads.@threads :static for i in 1:n
b1 = Libc.malloc(s1); b2 = Libc.malloc(s2); Libc.free(b1); Libc.free(b2); end
test_malloc (generic function with 3 methods)

@btime test_malloc(10000)
  272.707 μs (7 allocations: 416 bytes)

vs. 3.57x for:

test_malloc_in_order(n, s1=200, s2=200) = Threads.@threads :static for i in 1:n
b1 = Libc.malloc(s1); Libc.free(b1); b2 = Libc.malloc(s2); Libc.free(b2); end
test_malloc (generic function with 3 methods)

@btime test_malloc_in_order(10000)
  76.322 μs (42 allocations: 2.88 KiB)

This is excluding any GC overhead (and note the allocations shown are bogus, it doesn’t count Libc.malloc).

When you immediately free after your malloc, it’s the best case, for cache reasons. Even slightly out of order is costly as I’ve shown, probably not for cache reasons? With only one thread these are very comparable, with only 3.5% difference, might even be no difference only random-noise difference because of machine with web browser running.

So what does malloc do on a multi-threaded machine that makes it not scale? It must do some locking. You have only one shared heap. But what you CAN do and Python might do, is have a better multi-threaded malloc. It changed to a (patched) mimalloc in Python 3.13 (I’m testing with 3.12 though, someone might want to test with 3.13 and/or with a pure Python program, not calling from Julia).

What I can see would help, and malloc could do is divide the heap, conceptually at least, into 8 heaps, and then each each of the (or up to) 8 threads, can have its own without locking overhead. This, alone, doesn’t work since you do not want to be limited to 1/8th of your memory per thread (nor does it scale well to n threads), but it could have some escape hatch when your “own” heap runs out, with a shared space with and then only use locking.

Julia doesn’t actually call Libc.malloc directly. It has its own pools, by size. I think nothing like this for threads (only by size, since also useful), deferring to malloc doing such work. But it seems the default Libc.malloc that’s eventually called doesn’t partition, or as well as could be done, I guess.

@dnetto, might know, and others also about alternative malloc like mimalloc for Julia, someone was working on that.

Inherently with tracing GC you free out of order, and it would be, likely is, even more out of order. Python has reference counting, and it minimizes such (then tracing GC on top).

When I was trying to substitute Libc’s malloc for mimalloc, at the time, it was ineffective, because Julia wasn’t calling it that much. Julia’s own pools seem like a good idea to patch over inferior Libc.mallocc, but if it isn’t actually inferior then it seems like wasted work, and actually blocking access to potentially better Libc.malloc. All the work Julia does with pools, seems like malloc could and should do. This was written when Julia wasn’t yet multi-threaded, and may have made sense back then. I think Julia should just strip it out for simple code, rely more on malloc. It doesn’t have to be default malloc. I’ve proposed mimalloc, before Python adopted it. There are some other options too.

I’m looking into a similar test in Python, allocation and forced free (by calling some Python code, maybe not the C API directly), feel free to help:

Added in version 3.4.

void *PyMem_RawMalloc(size_t n)

Part of the Stable ABI since version 3.13.

artemsolod · November 14, 2024, 2:40pm

I think my original timings are misleading because they do not incorporate python’s garbage collection overhead fully. Apparently non-trivial amount of time is spent during the program exit. Running

f_py()
f_py()
f_py()
`date +%s%N` |> run |> println

with julia +release --project=. -t 8 script.jl; date +%s%N printed

1731594568703507834
1731594569832230869

i.e. ~1.2 seconds shutdown time. This grows about linearly with the number of f_py calls.

Palli · November 14, 2024, 2:58pm

Makes sense, since Python is reference counted.

You could try in Python:

os._exit() exits the program without calling cleanup handlers, flushing stdio buffers, etc. Thus, it is not a standard way to exit and should only be used in special cases.

When you’re using Julia, there’s likely some equivalent, when using both together, not sure what works…

In Python inly you would likely free as you go along (or have a memory leak), so doing this from Julia, does it induce in effect a memory leak? That was my experience, it blew up, when using it much, e.g. from @btime, so is there a workaround with PythonCall? If so you likely level the playing field and lose the Python/NumPy advantage.

Topic		Replies	Views
JuliaOptics performance discussion Performance	7	476	January 19, 2022
Memory on array element assignment Performance	10	431	August 3, 2022
Looking for Some Best Practices for Optimizing Julia Code Performance? General Usage	5	377	December 23, 2024
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3556	October 24, 2017
Poor performance due to memory allocations? Performance memory-allocation	17	2831	January 15, 2019

Improving parallel loop performance by using `numpy` for allocations

Related topics