DrWatson using tmpsave to save many files fails

I am using DrWatson to run simulations on a cluster. I use this scheme to run many jobs - Real World Examples · DrWatson. The problem is that when I use too many parameters (which generate ~180 files), tmpsave fails.
I get this error:

signal (7): Bus error
in expression starting at /home/labs/orenraz/roiho/Mpemba/AntiFerroMonteCarlo/scripts/monte_carlo_prepare_job.jl:43
unsafe_store! at ./pointer.jl:118 [inlined]
unsafe_store! at ./pointer.jl:118 [inlined]
jlunsafe_store! at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/JLD2.jl:42 [inlined]
jlunsafe_store! at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/misc.jl:15 [inlined]
_write at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/mmapio.jl:190 [inlined]
jlwrite at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/misc.jl:27 [inlined]
commit at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/datatypes.jl:275
h5fieldtype at /home/labs/orenraz/roiho/.julia/packages/JLD2/KN6F6/src/data/writing_datatypes.jl:364
unknown function (ip: 0x7f7e4b771a3d)

When I run the same script on my machine, it works fine.
So probably it is related to the cluster. Does anyone know how to solve this?

How are you calling tmpsave ?
I guess the directory ‘tmp’ is actually /home/username/tmp
If tmp is actually /tmp this is different on each cluster node

I would start by looking at your limits when running a job. It would be very unusual to be limited to 180 open files - but who knows.
Assuming you are using Slurm start an interactive job then show the limits
srun --nodes=1 --pty bash -i

ulimit -a
exit

when I call tmpsave I simply do tmpsave(dicts), which writes automatically to the folder _research/tmp/. This folder is located in the project folder of DrWatson.

I am not sure exactly how to identify the cluster so that people recognize it. Is IBM HPC sufficient?

Just to make sure I was clear: I am running a julia script which save 180 files, and then submits 180 jobs where each submission takes a file as an argument.
So I encounter the problem before the submission even - simply running the julia script causes the error when tmpsave is executed.

@roi.holtzman An IBM HPC probably will be using LSF
This is an IBM Blue Gene and not a Lenovo system?
I would still run an interactive batch job and have a look what filesystems are present when you are running under batch, and if you can ‘see’ those tmp files

I tis many years since I used LSF, however I think this should work

bsub -Is bash

1 Like

Emmmm… Looking at the pages for WEXAC… are you sure that the home folder on your personal machine is the same as your home folder on the HPC cluster?

I think you might have to do that interactive login and either
a) sync across those files
or
b) run tmpsave actually on the cluster storage

It looks like you are using Spectrum Scale (GPFS) storage on the HPC.
Can you ssh into the HPC and run these commands:
pwd
df -h .

DO you see the aame set of files you see on your personal machine? Probably not…

Can you ssh into the HPC cluster?

Once we have this figured out it might be a possible to moun the HPC home directory as an sshfs filesystem.
You can then drag and drop files between both locations.
I did not tell you this… :wink:

I can ssh to the cluster. This is how I run the jobs.
You’re right that the files I have on the cluster are not like I have on my own machine. However, I put everything there. All of the files are there.

I think that a crucial hint is that when I have a small amount of files (~20) everything works fine.

This is weird. The ulimits I have on an example HPC cluster are below.
Try ulimit -a please.

$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1028842
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 1048576
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 250556
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

This is what I find

$  ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 771761
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) 3000000
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) 43200
max user processes              (-u) 4096
virtual memory          (kbytes, -v) 3000000
file locks                      (-x) unlimited

That seems small to me. I usually have root access so would use the ‘lsof’ command to count the file descriptors a user has open.
Run ‘lsof -u youriserid’

Not a lot. The number of lines is 90:

$ lsof -u roiho | wc -l
90

Is the full output important? I didn’t want to spam…

The full output is not important - just the number of file descriptors.
My theory is that when you run with the 180 files you are exceeding the 1024 limit.

I could be very much leading you down the wrong track, so please do not just listen to me!
Have you asked the team who manage the HPC for some help?

It does seem like you’re on to something:) But I don’t know how to solve the issue.
I did contact the HPC team. Hope they will respond soon. Will update if they find a solution.

I have logged in to another access server and managed to get higher limits:

$ ulimit -a 
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 767357
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 512000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) 14400
max user processes              (-u) 1024
virtual memory          (kbytes, -v) 50194304
file locks                      (-x) unlimited

But unfortunately, this does not solve the issue.
The cluster teem is trying to figure out what is the issue.

The solution is found, and it very simple!
The space on the cluster has ran out, and therefore the julia script was not able to save the parameter files.
Deleting files and clearing some space solved the issue.