Hello,
In running some large scale parallel computations a colleague of mine has run into problems with memory leakage. To be more specific, the amount of memory used by the offending program gradually increases over the course of the computation until (for large enough problems) all system memory is used up and the program crashes. The memory required by the program should remain relatively constant after an initial spinup period. Weβve found the source of the problem but donβt understand whatβs going on. Iβm suspicious that the observed behaviour is due to some bug in how Future
s are marked as eligible for garbage collection but it may be that I just donβt understand whatβs going on. Weβve reduced the problem to the following MWE, with one version exhibiting the βmemory leakβ and the other not doing so. I apologize that the MWE is a bit long. The basic idea is that we have a master function that distributes subproblem computations to remote worker processes and then collects the results on the master process. Each worker may host multiple subproblems. There is some input data from the master process that all subproblems will need and must be communicated from the master process to the workers. Letβs first define a couple of functions:
function workerFunction( sigmaRef::Union{Future,RemoteChannel} )
sigma = fetch(sigmaRef)
println(sum(sigma))
return true
end
function leakingMasterFunction( n1::Int, n2::Int )
sigma = ones(n1)
nw = nworkers()
workerList = workers()
sigmaRef = Array{Future}(nw)
@sync begin
for i in 1:nw
p = workerList[i]
@async begin
sigmaRef[i] = remotecall_wait(identity,p,sigma)
for idx = 1:n2
b = remotecall_fetch(workerFunction,p,sigmaRef[i])
end # idx
end # @async begin
end # p
end # @sync begin
end # func
Now, assume weβve launched Julia with two or more workers and loaded the above functions. If we, from the REPL, execute
> nPerWorker = 2
> while true; masterFunction( 5000000 , nPerWorker); end
For nperWorker > 1
This will cause the memory used by Julia to increase until the system runs out of memory and the code crashes. With nPerWorker = 1
The infinite loop continues to run forever with relatively constant memory usage. Running finalize
on the elements of sigmaRef
doesnβt help.
However, if we take the following slightly different approach where RemoteChannel
s are used to communicate the input to the workers we see no memory leak
# Helper functions for non-leaking version of master function
function initRemoteChannel(func::Union{Function,Type}, pid::Int64, args...; kwargs...)
return RemoteChannel(()->initChannel(func,args,kwargs), pid)
end
function initChannel(func::Union{Function,Type},args::Tuple,kwargs::Array)
obj = func(args...; kwargs...)
chan = Channel{typeof(obj)}(1)
put!(chan,obj)
return chan
end
function nonLeakingMasterFunction( n1::Int, n2::Int )
sigma = ones(n1)
nw = nworkers()
workerList = workers()
sigmaRef = Array{RemoteChannel}(nw)
@sync begin
for i in 1:nw
p = workerList[i]
@async begin
sigmaRef[i] = initRemoteChannel(identity,p,sigma)
for idx = 1:n2
b = remotecall_fetch(workerFunction,p,sigmaRef[i])
end # idx
end # @async begin
end # p
end # @sync begin
end # func
The initRemoteChannel(func,p,args...,kwargs...)
function creates a RemoteChannel
on the master process that stores a reference to the output of func
, which is computed and then wrapped in a Channel
on worker p
. So, in summary, if we use the approach in leakingMasterFunction
to distribute sigma
to the workers we get memory leakage, and if we use the approach in nonLeakingMasterFunction
thereβs no leak.
As far as I can understand from the parallel computing docs in the Julia manual, the value referenced by a Future
should be eligible for garbage collection once all references to it have either been fetched or marked for garbage collection themselves. So why does the above code cause the system to run out of memory? Is this a bug or am I missing something about how Future
s work?
Many thanks for any help, Patrick
Edit: I should add that Iβve read https://github.com/JuliaLang/julia/issues/12099. I understand that if an unfetched Future
isnβt actually garbage collected then the process holding the value referenced by that Future
might not get notified that the value can be garbage collected. As I mentioned above, explicitly calling finalize
doesnβt help matters so I donβt think this is the issue but Iβd be happy if someone could explain to me why Iβm wrong.