Tasks from Distributed don't release memory

Hello,

I use Julia on a SlurmCluster. I recently encountered several problems with OutOfMemory errors.

I run a package for Spiking Neural Network simulations that I co-develop, and in which I have no explicit memory management. In this package, the object that allocate lots of memory are the model, which are a hierarchy of named tuples and structs. The structs are defined in the module and host large chunks of data.

In my simulations on the cluster, I run something like this:

using Distributed
@everywhere using MyModule 

@everywhere run_model()
     model = MyModule.gimme_model()
     MyModule.sim_model(model)
     MyModule.store_model(model)
     return nothing 
end

@sync @distributed for w in workers()[1:3]
    @spawnat w run_model()
end

The models are always defined in function or let scopes; they populate their memory and store them to disk. I assumed that when the scope closes, the memory would be released, but apparently it s not so.

I also noticed that if an error occurs in the function running on the worker, the worker will withhold the memory, and I have to use the very not nice pkill julia to empty that memory!

So, what am I doing wrong? How should I properly manage the meory, in the Distributed framework, and in my package?

PS.

I use a Python tool to monitor the process and run bayesian parameter optimization, called Optuna. From the Optuna Dashboard I can see that some of the failed process are still “running”, even if the they were launched from a julia kernel that is now closed

bash_kernel $ julia run_workers.jl
# which is now terminated, the terminal is closed!