I am currently running some data analysis on a cluster with 68 cores per node. To this end I set up a SharedArray for 68 workers and let them operate on it via @sync@distributed. For small datasets this works fine on my local machine.
If I however launch my Julia script on the cluster node, the workers seem to connect to the master (at least nprocs() = 69) before the iteration, but then I see at most 3 workers really doing something using top. Simultaneously the master keeps consuming more and more memory until the code crashes. I really do not get the problem behind this, since locally I cannot observe any memory leakage. Any ideas how to test what’s going on with the workers?
I am using Julia 1.0.1.
Hoping somebody can help out.
EDIT 1: Below you can find a MWE that causes this for me.
using Distributed
using SharedArrays
addprocs(68, topology=:master_worker)
A = ones(Float64, 100000000)
B = SharedArray{Float64, 1}((length(A)))
@sync @distributed for i in 1 : length(A)
B[i] = A[i]
end
EDIT 2:
The MWE also works without the shared array. Use e.g. println(A[i]) in the loop. This yields
From Worker N: 1.0
But the workers print their id (=N) sequentially, only one operating at a time. So first N = 2, then N = 7, then some other id and so on.
EDIT 3:
After further investigation, it seems to me that I am doing something wrong when allocating resources for an interactive session or when submitting the job. Namely, I guess all processes are started on the same core, causing them to execute serially and not in parallel. Can somebody explain how to allocate resources on a slurm cluster for the above shared memory problem? So far I used
salloc --nodes=1 --ntasks=1 --cpus-per-task=68
and then started the julia code from above via julia MWE.jl.
Do the allocations of A and B succeed without crashing? And can you confirm that all 68 workers remain alive while that loop is running (they don’t get killed by OOM or something similar)?
The allocation succeeds. procs(B) shows all 68 workers, so B should also be properly mapped. How would you confirm the latter? From top I only see that some small number of workers (2 or 3) is active with the rest apparently sleeping. The IDs of the respective processes change though.
Are those IDs actual workers, or just other threads of the master process? It’s possible the libuv threads of the master process are doing all that work for whatever reason, which wouldn’t be very obvious from the output of top (which is why I use htop).
Julia itself uses threading for each process (via libuv) to service things like syscalls and other blocking operations. My point was that, if for some reason you were hitting an endless stream of syscalls, it would probably look like 2-3 threads running eternally (although I think libuv starts 4 by default). Especially if you’re seeing a ton of epoll_pwait, which is what libuv calls to wait on its set of file descriptors.
Once you enable “Tree view” in the htop settings, htop shows them branching off of the tree just like processes, but puts them (in my case) in a slightly dimmer color than child processes:
In the above image, nvim has one thread (also called nvim), and also one child process (called languageclient).
It has not been responded to yet. Should have written this more explicitly however. After the salloc I do srun - - pty bash and start the script afterwards. Alternatively I can also call it directly from the Julia REPL. That should be the same as running directly with srun right?
What happens when you use 48 workers?
I ask since there is this in lubuv
assert(timeout >= -1);
base = loop->time;
count = 48; /* Benchmarks suggest this gives the best throughput. */
real_timeout = timeout;
for (;;) {
/* See the comment for max_safe_timeout for an explanation of why
* this is necessary. Executive summary: kernel bug workaround.
*/
if (sizeof(int32_t) == sizeof(long) && timeout >= max_safe_timeout)
timeout = max_safe_timeout;
nfds = epoll_pwait(loop->backend_fd,
events,
ARRAY_SIZE(events),
timeout,
psigset);