@distributed fails for many workers?

Hello folks,

I am currently running some data analysis on a cluster with 68 cores per node. To this end I set up a SharedArray for 68 workers and let them operate on it via @sync @distributed. For small datasets this works fine on my local machine.

If I however launch my Julia script on the cluster node, the workers seem to connect to the master (at least nprocs() = 69) before the iteration, but then I see at most 3 workers really doing something using top. Simultaneously the master keeps consuming more and more memory until the code crashes. I really do not get the problem behind this, since locally I cannot observe any memory leakage. Any ideas how to test what’s going on with the workers?

I am using Julia 1.0.1.

Hoping somebody can help out.

EDIT 1: Below you can find a MWE that causes this for me.

using Distributed
using SharedArrays

addprocs(68, topology=:master_worker)

A = ones(Float64, 100000000)
B = SharedArray{Float64, 1}((length(A)))

@sync @distributed for i in 1 : length(A)
   B[i] = A[i]
end

EDIT 2:

The MWE also works without the shared array. Use e.g. println(A[i]) in the loop. This yields

From Worker N: 1.0

But the workers print their id (=N) sequentially, only one operating at a time. So first N = 2, then N = 7, then some other id and so on.

EDIT 3:

After further investigation, it seems to me that I am doing something wrong when allocating resources for an interactive session or when submitting the job. Namely, I guess all processes are started on the same core, causing them to execute serially and not in parallel. Can somebody explain how to allocate resources on a slurm cluster for the above shared memory problem? So far I used

salloc --nodes=1 --ntasks=1 --cpus-per-task=68

and then started the julia code from above via julia MWE.jl.

Do the allocations of A and B succeed without crashing? And can you confirm that all 68 workers remain alive while that loop is running (they don’t get killed by OOM or something similar)?

The allocation succeeds. procs(B) shows all 68 workers, so B should also be properly mapped. How would you confirm the latter? From top I only see that some small number of workers (2 or 3) is active with the rest apparently sleeping. The IDs of the respective processes change though.

Are those IDs actual workers, or just other threads of the master process? It’s possible the libuv threads of the master process are doing all that work for whatever reason, which wouldn’t be very obvious from the output of top (which is why I use htop).

These are separate processes, not threads.
I see many processes sitting in epoll_pwait as three or four processes are running.

If I run thei minimal example I do not get a crash - ARM 64 platform, lots of RAM, Julia 1.1

Julia itself uses threading for each process (via libuv) to service things like syscalls and other blocking operations. My point was that, if for some reason you were hitting an endless stream of syscalls, it would probably look like 2-3 threads running eternally (although I think libuv starts 4 by default). Especially if you’re seeing a ton of epoll_pwait, which is what libuv calls to wait on its set of file descriptors.

I am actually not sure if they are threads or processes, I’d have to check. How do u tell from htop?

But you can reproduce that not all workers participate equally in executing the loop?

Remember not all workers in a parallel computation are guaranteed to finish at the same time.

Once you enable “Tree view” in the htop settings, htop shows them branching off of the tree just like processes, but puts them (in my case) in a slightly dimmer color than child processes:
htop

In the above image, nvim has one thread (also called nvim), and also one child process (called languageclient).

Sure. But they should at least all start more or less simultaneously and not one after another

Thanks. I’ll try that out

I don’t know if your Edit has been responded to, but try running srun instead of salloc.

It has not been responded to yet. Should have written this more explicitly however. After the salloc I do srun - - pty bash and start the script afterwards. Alternatively I can also call it directly from the Julia REPL. That should be the same as running directly with srun right?

Tested it, these are indeed julia processes not process threads.

You have a 68 core compute node? May I Ask what architecture the processors are? Intel, AMD or ARM?

Its an Intel Architecture with KNL processors

What happens when you use 48 workers?
I ask since there is this in lubuv

assert(timeout >= -1);
  base = loop->time;
  count = 48; /* Benchmarks suggest this gives the best throughput. */
  real_timeout = timeout;

  for (;;) {
    /* See the comment for max_safe_timeout for an explanation of why
     * this is necessary.  Executive summary: kernel bug workaround.
     */
    if (sizeof(int32_t) == sizeof(long) && timeout >= max_safe_timeout)
      timeout = max_safe_timeout;

    nfds = epoll_pwait(loop->backend_fd,
                       events,
                       ARRAY_SIZE(events),
                       timeout,
                       psigset);

Someone please remind me how to quote code on here…

use ``` before and after your code snippet