Issues using Stan.jl on a cluster to run embarrassingly parallelisable experiments


So I have a working experiment of some Stan code on my local machine. But when I do wrap the main part in a pmap with an interior where I load in a series of models, then do sampling and write evaluation metrics to a file I encounter issues. I am struggling to debug what is going wrong when I move from local to cluster as there is no error output or julia crashing, the models even seem to compile:

slurmstepd: error: *** STEP 252668.0 ON cnode33 CANCELLED AT 2020-05-15T15:17:40 ***

signal (15): Terminated
in expression starting at none:0
epoll_pwait at /lib64/ (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:270
uv_run at /workspace/srcdir/libuv/src/unix/core.c:359
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:449
poptaskref at ./task.jl:702
wait at ./task.jl:709 [inlined]
task_done_hook at ./task.jl:444
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:198
start_task at /buildworker/worker/package_linux64/build/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
Allocations: 145913469 (Pool: 145888642; Big: 24827); GC: 111

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/HGCoIY86ky7B/Beta.stan updated.

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/dsTQsTeiYfZw/Weighted.stan updated.

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/55ovK4FHBXjB/Naive.stan updated.

/gpfs/home/dcs/csrxgb/julia_stuff/src/logistic_regression/tmp/BwHGP7ZeJzpD/NoSynth.stan updated.
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

However, progress then hangs, I left this running for 50 mins, sampling should take around 3 seconds so I eventually cancelled it and could then get this output from the worker logs. There is no further error or indication as to what is going wrong, does anyone have experience with problems like this?