I am using DrWatson to run simulations on a cluster. I use this scheme to run many jobs - Real World Examples · DrWatson. The problem is that when I use too many parameters (which generate ~180 files), tmpsave fails.
I get this error:
signal (7): Bus error
in expression starting at /home/labs/orenraz/roiho/Mpemba/AntiFerroMonteCarlo/scripts/monte_carlo_prepare_job.jl:43
unsafe_store! at ./pointer.jl:118 [inlined]
unsafe_store! at ./pointer.jl:118 [inlined]
jlunsafe_store! at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/JLD2.jl:42 [inlined]
jlunsafe_store! at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/misc.jl:15 [inlined]
_write at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/mmapio.jl:190 [inlined]
jlwrite at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/misc.jl:27 [inlined]
commit at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/datatypes.jl:275
h5fieldtype at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/data/writing_datatypes.jl:364
unknown function (ip: 0x7f7e4b771a3d)
When I run the same script on my machine, it works fine.
So probably it is related to the cluster. Does anyone know how to solve this?
How are you calling tmpsave ?
I guess the directory ‘tmp’ is actually /home/username/tmp
If tmp is actually /tmp this is different on each cluster node
I would start by looking at your limits when running a job. It would be very unusual to be limited to 180 open files - but who knows.
Assuming you are using Slurm start an interactive job then show the limits
srun --nodes=1 --pty bash -i
when I call tmpsave I simply do tmpsave(dicts), which writes automatically to the folder _research/tmp/. This folder is located in the project folder of DrWatson.
I am not sure exactly how to identify the cluster so that people recognize it. Is IBM HPC sufficient?
Just to make sure I was clear: I am running a julia script which save 180 files, and then submits 180 jobs where each submission takes a file as an argument.
So I encounter the problem before the submission even - simply running the julia script causes the error when tmpsave is executed.
@roi.holtzman An IBM HPC probably will be using LSF
This is an IBM Blue Gene and not a Lenovo system?
I would still run an interactive batch job and have a look what filesystems are present when you are running under batch, and if you can ‘see’ those tmp files
I tis many years since I used LSF, however I think this should work
Once we have this figured out it might be a possible to moun the HPC home directory as an sshfs filesystem.
You can then drag and drop files between both locations.
I did not tell you this…
I can ssh to the cluster. This is how I run the jobs.
You’re right that the files I have on the cluster are not like I have on my own machine. However, I put everything there. All of the files are there.
I think that a crucial hint is that when I have a small amount of files (~20) everything works fine.
That seems small to me. I usually have root access so would use the ‘lsof’ command to count the file descriptors a user has open.
Run ‘lsof -u youriserid’
The full output is not important - just the number of file descriptors.
My theory is that when you run with the 180 files you are exceeding the 1024 limit.
I could be very much leading you down the wrong track, so please do not just listen to me!
Have you asked the team who manage the HPC for some help?
It does seem like you’re on to something:) But I don’t know how to solve the issue.
I did contact the HPC team. Hope they will respond soon. Will update if they find a solution.
The solution is found, and it very simple!
The space on the cluster has ran out, and therefore the julia script was not able to save the parameter files.
Deleting files and clearing some space solved the issue.