So I have a working experiment of some Stan code on my local machine. But when I do wrap the main part in a
pmap with an interior where I load in a series of models, then do sampling and write evaluation metrics to a file I encounter issues. I am struggling to debug what is going wrong when I move from local to cluster as there is no error output or julia crashing, the models even seem to compile:
julia_worker:9189#10.1.24.33 slurmstepd: error: *** STEP 252668.0 ON cnode33 CANCELLED AT 2020-05-15T15:17:40 *** signal (15): Terminated in expression starting at none:0 epoll_pwait at /lib64/libc.so.6 (unknown line) uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:270 uv_run at /workspace/srcdir/libuv/src/unix/core.c:359 jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:449 poptaskref at ./task.jl:702 wait at ./task.jl:709 [inlined] task_done_hook at ./task.jl:444 _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322 jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined] jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:198 start_task at /buildworker/worker/package_linux64/build/src/task.c:697 unknown function (ip: (nil)) unknown function (ip: (nil)) Allocations: 145913469 (Pool: 145888642; Big: 24827); GC: 111 1.15 /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/HGCoIY86ky7B/Beta.stan updated. /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/dsTQsTeiYfZw/Weighted.stan updated. /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/55ovK4FHBXjB/Naive.stan updated. /gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/BwHGP7ZeJzpD/NoSynth.stan updated. slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
However, progress then hangs, I left this running for 50 mins, sampling should take around 3 seconds so I eventually cancelled it and could then get this output from the worker logs. There is no further error or indication as to what is going wrong, does anyone have experience with problems like this?