Hi
I run my Julia code on a cluster and save the output data frequently into a jld2 file by using jldsave. On the desktop computer, performance remains constant, but on the cluster, the performance drops significantly after each time data is stored (overwriting the previous file). I am new to Julia but I think it is rather a computer environment issue, however, I am not at all a specialist. I hoped that maybe someone could help me here.
Many thanks in advance, J
Hi,
Seeing as your issue depends on the computer environment, it would help if you could post more information about this environment (e.g. OS, filesystem) (and in particular the differences compared to where it does work). When you use System Monitor / Task Manager, do you see anything out of the ordinary (e.g. 100% disk usage)?
You should also post (the relevant part of) this code, ideally condensed down to a MWE. See also Please read: make it easier to help you (4. and 5.).
Hi eldee
Thanks for the reply.
- On the cluster, there is a batch system (slurm) running on Ubuntu, CPU usage close to 100%, and memory increasing slightly after saving the file I talked about in the main post. The increase is about 200 MB, while the file remains at the same size of around 2 GB. Total memory use is ~9 GB.
- On the desktop I use VS code on windows. One timestep constantly takes around 30 seconds, while on the cluster it takes around 17 seconds, increasing to 46 after saving files.
- Julia 1.10.3 and 1.10.4 i cluster, computer, respectively.
- I make a garbage collection after each time step. Does not change anything when doing so versus not doing so. But generally, I think there may be a memory leak somewhere, again, without knowing much about the topic.
For nuw, I just did a workaround, breaking the code when saving a file, and start a new run from the respective time step, which then runs on the initial performance again.
Sorry for not being very clear. I tried to do a MWE, but the code is so huge and without knowing where the problem might be, I don’t know where to start…
Many thanks
If this does indeed work faster than just keeping the code running (i.e. takes less than 46 s per timestep), then it seems to me that it’s not just an issue with the computer environment, as it shows the OS is capable of more computations and writes. So I would look more into the concrete Julia code or JLD2.jl itself.
Assuming the title of the topic is correct, the logical place to start would be to isolate the part with jldsave
. If it’s too difficult to reduce the full code into a MWE, i.e. reduce the complexity while retaining the problems, you could conversely also start with simple working code, and add more complexity until you encounter the same problems.
Some things you could try, if you haven’t already:
- If you keep the full code, but comment out the
jldsave
line, does everything then work fine?- What if you do save, but just some random new data?
- What happens if you don’t overwrite the same file every time, but instead always create a new one?
- Does switching from JLD2.jl to another package for saving help?
- If you don’t use SLURM (and I guess your code then only runs on a single node), do you encounter the same issues?
Hi Eldee
Many thanks for the reply and the great advices. I will try to save much less data and go step by step with the other tips mentiones!
Great, thanks
jbr