Hello,
I use Julia on a SlurmCluster. I recently encountered several problems with OutOfMemory errors.
I run a package for Spiking Neural Network simulations that I co-develop, and in which I have no explicit memory management. In this package, the object that allocate lots of memory are the model
, which are a hierarchy of named tuples and structs. The structs are defined in the module and host large chunks of data.
In my simulations on the cluster, I run something like this:
using Distributed
@everywhere using MyModule
@everywhere run_model()
model = MyModule.gimme_model()
MyModule.sim_model(model)
MyModule.store_model(model)
return nothing
end
@sync @distributed for w in workers()[1:3]
@spawnat w run_model()
end
The models are always defined in function
or let
scopes; they populate their memory and store them to disk. I assumed that when the scope closes, the memory would be released, but apparently it s not so.
I also noticed that if an error occurs in the function running on the worker, the worker will withhold the memory, and I have to use the very not nice pkill julia
to empty that memory!
So, what am I doing wrong? How should I properly manage the meory, in the Distributed framework, and in my package?
PS.
I use a Python tool to monitor the process and run bayesian parameter optimization, called Optuna. From the Optuna Dashboard I can see that some of the failed process are still “running”, even if the they were launched from a julia kernel that is now closed
bash_kernel $ julia run_workers.jl
# which is now terminated, the terminal is closed!