Hello,
In running some large scale parallel computations a colleague of mine has run into problems with memory leakage. To be more specific, the amount of memory used by the offending program gradually increases over the course of the computation until (for large enough problems) all system memory is used up and the program crashes. The memory required by the program should remain relatively constant after an initial spinup period. Weβve found the source of the problem but donβt understand whatβs going on. Iβm suspicious that the observed behaviour is due to some bug in how Futures are marked as eligible for garbage collection but it may be that I just donβt understand whatβs going on. Weβve reduced the problem to the following MWE, with one version exhibiting the βmemory leakβ and the other not doing so. I apologize that the MWE is a bit long. The basic idea is that we have a master function that distributes subproblem computations to remote worker processes and then collects the results on the master process. Each worker may host multiple subproblems. There is some input data from the master process that all subproblems will need and must be communicated from the master process to the workers. Letβs first define a couple of functions:
function workerFunction( sigmaRef::Union{Future,RemoteChannel} )
sigma = fetch(sigmaRef)
println(sum(sigma))
return true
end
function leakingMasterFunction( n1::Int, n2::Int )
sigma = ones(n1)
nw = nworkers()
workerList = workers()
sigmaRef = Array{Future}(nw)
@sync begin
for i in 1:nw
p = workerList[i]
@async begin
sigmaRef[i] = remotecall_wait(identity,p,sigma)
for idx = 1:n2
b = remotecall_fetch(workerFunction,p,sigmaRef[i])
end # idx
end # @async begin
end # p
end # @sync begin
end # func
Now, assume weβve launched Julia with two or more workers and loaded the above functions. If we, from the REPL, execute
> nPerWorker = 2
> while true; masterFunction( 5000000 , nPerWorker); end
For nperWorker > 1 This will cause the memory used by Julia to increase until the system runs out of memory and the code crashes. With nPerWorker = 1 The infinite loop continues to run forever with relatively constant memory usage. Running finalize on the elements of sigmaRef doesnβt help.
However, if we take the following slightly different approach where RemoteChannels are used to communicate the input to the workers we see no memory leak
# Helper functions for non-leaking version of master function
function initRemoteChannel(func::Union{Function,Type}, pid::Int64, args...; kwargs...)
return RemoteChannel(()->initChannel(func,args,kwargs), pid)
end
function initChannel(func::Union{Function,Type},args::Tuple,kwargs::Array)
obj = func(args...; kwargs...)
chan = Channel{typeof(obj)}(1)
put!(chan,obj)
return chan
end
function nonLeakingMasterFunction( n1::Int, n2::Int )
sigma = ones(n1)
nw = nworkers()
workerList = workers()
sigmaRef = Array{RemoteChannel}(nw)
@sync begin
for i in 1:nw
p = workerList[i]
@async begin
sigmaRef[i] = initRemoteChannel(identity,p,sigma)
for idx = 1:n2
b = remotecall_fetch(workerFunction,p,sigmaRef[i])
end # idx
end # @async begin
end # p
end # @sync begin
end # func
The initRemoteChannel(func,p,args...,kwargs...) function creates a RemoteChannel on the master process that stores a reference to the output of func, which is computed and then wrapped in a Channel on worker p. So, in summary, if we use the approach in leakingMasterFunction to distribute sigma to the workers we get memory leakage, and if we use the approach in nonLeakingMasterFunction thereβs no leak.
As far as I can understand from the parallel computing docs in the Julia manual, the value referenced by a Future should be eligible for garbage collection once all references to it have either been fetched or marked for garbage collection themselves. So why does the above code cause the system to run out of memory? Is this a bug or am I missing something about how Futures work?
Many thanks for any help, Patrick
Edit: I should add that Iβve read https://github.com/JuliaLang/julia/issues/12099. I understand that if an unfetched Future isnβt actually garbage collected then the process holding the value referenced by that Future might not get notified that the value can be garbage collected. As I mentioned above, explicitly calling finalize doesnβt help matters so I donβt think this is the issue but Iβd be happy if someone could explain to me why Iβm wrong.