We are having a fantastic experience with Julia, but then had a confusing experience. I am doing a simulation study, multithreaded, and eventually Julia runs out of memory and crashes. This happens especially on Linux (on Intel) but can happen on an ARM (M2). When I run the code below, and monitor Julia’s memory use, it keeps climbing, until it crashes. Clearly, is is not doing garbage collection, especially on Linux. But depending on what else code does, same can happen on the M2. We are on the latest version.
best, d
using Pkg, Revise,StatsBase
function Simulate()
Simulations=Int(1e7)
Size=1000
result = Array{Float64}(undef, Simulations, 1)
Threads.@threads for i = 1:Simulations
x = randn(Size)
s = sort(x)
result[i, 1] = s[1]
end
println(median(result))
end
for i in 1:1000
println(i)
Simulate()
end
Not saying that there shouldnt be a built-in safeguard for this but your code is allocating so much memory that its not suprising it runs out. Each thread is allocation two 1000 element arrays per iteration.
I dont know if this is just meant to be an MWE, but in general its a good idea to avoid so many allocations especially in multithreaded code.
I agree. That said, this code sample took out all the computationally intensive calculations, so the overhead is minimal the actual code.
Even then, I just dont know how to preallocate in multithreading code like this. Would need to pre-allocate one vector per core, and be able to use that correctly. Maybe possible, I just don’t know how, I’ll take a closer look at the docs.
and this problem is much worse on Linux than on a Mac under the same Julia version.
@rafael.guerra thanks, yes, agree in this sample code, but in the actual code, which does a lot of calculations, it needs to be a matrix, and I just carried it over.
To avoid most of the allocations, the following should work (I havent checked if it runs)
using Pkg, Revise,StatsBase
import ChunkSplitters
function Simulate()
Simulations=round(Int,1e7)
Size=1000
buffers = [zeros(Size) for _ in 1:Threads.nthreads()]
result = zeros(Simulations,1)
chunks = ChunkSplitters.chunk(1:Simulations,n=Threads.nthreads())
Threads.@threads for (n_chunk,indices) in enumerate(chunks)
x = buffers[n_chunk]
for i = 1:Simulations
randn!(x)
sort!(x)
result[i,1] = s[1] # I guess in this case sorting is technically not needed and could be replaced by maximum, but perhaps you need the sorting for other reasons
end
end
println(median(result))
end
for i in 1:1000
println(i)
Simulate()
end
PS: note I also replaced Int(1e7) by rounding. Probably not absolutely needed in your use-case but seems less error-prone
Maybe not to distract from the more fundamental problem, does it work if you run julia with zhe flag julia --heap-size-hint=8G (or whatever amount of memory you have available, say 80 percent of your RAM)
in theory this should allow for more aggressive GC though im not sure how well it works for multithreaded code
Thanks for this, had not seen ChunkSplitters, so many wonderful packages to discover.
This code, didn’t quite solve it. Besides the small problems of needing using Random and ChunkSplitters.chunks, it kept on allocating more memory on both Linux and Mac (and was much slower on the Linux, which usually is faster than the Mac). I think, perhaps it repeats the inner loop for all Simulations for all Threads, and one would need to spit the inner loop into Simulations/number of simulations chunks?
That did indeed force it to do GC, but did not solve it.
a) the agressive GC really slows the code when it close to the limit
b) The code crashed previously since the inner simulation loop took memory allocation up to the limit (64G in the machines) but that meant it could not allocate for the post simulation processing. and crashed.
You could reuse the buffers and (at least in this example) result between different runs of Simulate by moving them into the arguments of the method:
function Simulate!(buffers, result)
Simulations = size(result, 1)
Size = length(first(buffers))
chunks = ...
...
end
function SimulateLoop(runs=1000, Simulations=10^7, Size=1000)
buffers = [zeros(Size) for _ in 1:Threads.nthreads()]
result = zeros(Simulations, 1)
for i in 1:runs
println(i)
Simulate!(buffers, result)
end
end
using Pkg, Revise,StatsBase
import ChunkSplitters
function Simulate()
Simulations=round(Int,1e7)
Size=1000
result = zeros(Simulations)
Threads.@threads for c in ChunkSplitters.chunks(1:Simulations,n=Threads.nthreads())
x = zeros(Size)
for i in c
randn!(x)
sort!(x)
result[i,1] = x[1]
end
end
println(median(result))
end
for i in 1:1000
println(i)
Simulate()
end
I think the above should work (I’m on the phone)
To be completely safe about allocations, you can preallocate the temporary arrays:
using Pkg, Revise,StatsBase, Random
import ChunkSplitters
function Simulate!(result, xt)
result .= 0.0
Threads.@threads for (ic, c) in enumerate(ChunkSplitters.chunks(eachindex(result),n=length(xt)))
x = xt[ic]
x .= 0.0
for i in c
randn!(x)
sort!(x)
result[i] = x[1]
end
end
println(median(result))
end
function run(; ntasks=Threads.nthreads())
Simulations=10^7
Size=1000
result = zeros(Simulations)
xt = [ zeros(Size) for _ in ntasks ]
for i in 1:1000
println(i)
Simulate!(result, xt)
end
end
@Davide98888 if your actual problem has this structure, the above can solve, in practice, the issues you are having. (It does not solve the GC bug)
Or you could parallelize the loop over the simulations, at a higher level.
Correct me if I am wrong but it seems to me that most (all ?) of the answers tend to improve the OP’s code reducing the allocations which is interesting per se but do not address whether the original OP’s MWE actually illustrates a genuine threading bug/pb with the GC (It looks like one to me). If it is the case, I guess that an issue should be filled.
How are you measuring the leakage? Just looking at top? I’m curious how your code behaves on my machine, so I’d like to reproduce your methodology. The example your showing does indeed allocate a lot of memory overall, but it shouldn’t leak this. There’s plenty of opportunities here for the GC to run, and eventually reuse those allocations internally.
I use btop on my Linux and Mac. My production code eventually crashes when it does processing of results after the simulation loop, and btop shows me it happens when Julia has taken all of the 64G on my machines (each simulation loop needs about 0.5G). The linux console gives an out of memory message also.