Issues using Stan.jl on a cluster to run embarrassingly parallelisable experiments

HarrisonWilde · May 15, 2020, 3:01pm

Hi,

So I have a working experiment of some Stan code on my local machine. But when I do wrap the main part in a pmap with an interior where I load in a series of models, then do sampling and write evaluation metrics to a file I encounter issues. I am struggling to debug what is going wrong when I move from local to cluster as there is no error output or julia crashing, the models even seem to compile:

julia_worker:9189#10.1.24.33
slurmstepd: error: *** STEP 252668.0 ON cnode33 CANCELLED AT 2020-05-15T15:17:40 ***

signal (15): Terminated
in expression starting at none:0
epoll_pwait at /lib64/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:270
uv_run at /workspace/srcdir/libuv/src/unix/core.c:359
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:449
poptaskref at ./task.jl:702
wait at ./task.jl:709 [inlined]
task_done_hook at ./task.jl:444
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:198
start_task at /buildworker/worker/package_linux64/build/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
Allocations: 145913469 (Pool: 145888642; Big: 24827); GC: 111
1.15

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/HGCoIY86ky7B/Beta.stan updated.

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/dsTQsTeiYfZw/Weighted.stan updated.

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/55ovK4FHBXjB/Naive.stan updated.

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/BwHGP7ZeJzpD/NoSynth.stan updated.
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

However, progress then hangs, I left this running for 50 mins, sampling should take around 3 seconds so I eventually cancelled it and could then get this output from the worker logs. There is no further error or indication as to what is going wrong, does anyone have experience with problems like this?

Topic		Replies	Views
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1398	May 19, 2022
Code that works fine locally causes an error on a cluster Julia at Scale question	3	459	May 14, 2020
Issues implementing a simple Stan example General Usage stan	33	2340	February 19, 2022
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2180	January 3, 2018
Multithreading problems in running Julia in a Slurm cluster General Usage question , cluster , slurm	8	555	April 18, 2024

Issues using Stan.jl on a cluster to run embarrassingly parallelisable experiments

Related topics