Tasks from Distributed don't release memory

alequa · April 7, 2025, 2:36pm

Hello,

I use Julia on a SlurmCluster. I recently encountered several problems with OutOfMemory errors.

I run a package for Spiking Neural Network simulations that I co-develop, and in which I have no explicit memory management. In this package, the object that allocate lots of memory are the model, which are a hierarchy of named tuples and structs. The structs are defined in the module and host large chunks of data.

In my simulations on the cluster, I run something like this:

using Distributed
@everywhere using MyModule 

@everywhere run_model()
     model = MyModule.gimme_model()
     MyModule.sim_model(model)
     MyModule.store_model(model)
     return nothing 
end

@sync @distributed for w in workers()[1:3]
    @spawnat w run_model()
end

The models are always defined in function or let scopes; they populate their memory and store them to disk. I assumed that when the scope closes, the memory would be released, but apparently it s not so.

I also noticed that if an error occurs in the function running on the worker, the worker will withhold the memory, and I have to use the very not nice pkill julia to empty that memory!

So, what am I doing wrong? How should I properly manage the meory, in the Distributed framework, and in my package?

PS.

I use a Python tool to monitor the process and run bayesian parameter optimization, called Optuna. From the Optuna Dashboard I can see that some of the failed process are still “running”, even if the they were launched from a julia kernel that is now closed

bash_kernel $ julia run_workers.jl
# which is now terminated, the terminal is closed!

Topic		Replies	Views
Should memory in worker be freed after fetching result? Julia at Scale	1	747	May 31, 2019
What makes a worker terminate on a cluster? General Usage parallel , cluster , distributed	16	2781	January 5, 2022
Memory problem when using SlurmClusterManager.jl to add workers Julia at Scale distributed , slurm , parallel-computing	6	71	July 28, 2025
Memory leak with @distributed New to Julia	6	678	February 6, 2023
@distributed fails for many workers? Julia at Scale	21	1658	May 14, 2019

Tasks from Distributed don't release memory

Related topics