I’ve been using slurm for awhile w/ julia. I’ve only recently been running into issues where I’m running out of memory for long running processes (like 30hrs). The jobs are highly regular, effectively calling the same function over and over on the same model (i.e. a reinforcement learning agent). When I’m testing locally I don’t see any obvious memory leaks. What I’ve hypothesized is that on the cluster julia doesn’t know about the boundary slurm is putting up around memory and thus not garbage collecting as aggressively as I need. Does anyone know if Julia respects the memory boundaries slurm enforces? And if not, does anyone know how to get julia to respect them?
The current fix proposed by someone on the Julia Slack is to add calls to the garbage collector every hour or so (which has solved the issue afaict). I’ve done this in my current experiments, but this seems like a temporary fix. Any other ideas?
This has been raised before Should memory in worker be freed after fetching result?, and a solution to add the ability to set the max memory was discussed here command line flag to limit heap memory usage? · Issue #17987 · JuliaLang/julia · GitHub quite awhile ago.