@distributed fails for many workers?

dkiese · May 12, 2019, 5:33pm

Hello folks,

I am currently running some data analysis on a cluster with 68 cores per node. To this end I set up a SharedArray for 68 workers and let them operate on it via @sync @distributed. For small datasets this works fine on my local machine.

If I however launch my Julia script on the cluster node, the workers seem to connect to the master (at least nprocs() = 69) before the iteration, but then I see at most 3 workers really doing something using top. Simultaneously the master keeps consuming more and more memory until the code crashes. I really do not get the problem behind this, since locally I cannot observe any memory leakage. Any ideas how to test what’s going on with the workers?

I am using Julia 1.0.1.

Hoping somebody can help out.

EDIT 1: Below you can find a MWE that causes this for me.

using Distributed
using SharedArrays

addprocs(68, topology=:master_worker)

A = ones(Float64, 100000000)
B = SharedArray{Float64, 1}((length(A)))

@sync @distributed for i in 1 : length(A)
   B[i] = A[i]
end

EDIT 2:

The MWE also works without the shared array. Use e.g. println(A[i]) in the loop. This yields

From Worker N: 1.0

But the workers print their id (=N) sequentially, only one operating at a time. So first N = 2, then N = 7, then some other id and so on.

EDIT 3:

After further investigation, it seems to me that I am doing something wrong when allocating resources for an interactive session or when submitting the job. Namely, I guess all processes are started on the same core, causing them to execute serially and not in parallel. Can somebody explain how to allocate resources on a slurm cluster for the above shared memory problem? So far I used

salloc --nodes=1 --ntasks=1 --cpus-per-task=68

and then started the julia code from above via julia MWE.jl.

jpsamaroo · May 13, 2019, 12:15pm

Do the allocations of A and B succeed without crashing? And can you confirm that all 68 workers remain alive while that loop is running (they don’t get killed by OOM or something similar)?

dkiese · May 13, 2019, 12:20pm

The allocation succeeds. procs(B) shows all 68 workers, so B should also be properly mapped. How would you confirm the latter? From top I only see that some small number of workers (2 or 3) is active with the rest apparently sleeping. The IDs of the respective processes change though.

jpsamaroo · May 13, 2019, 7:14pm

Are those IDs actual workers, or just other threads of the master process? It’s possible the libuv threads of the master process are doing all that work for whatever reason, which wouldn’t be very obvious from the output of top (which is why I use htop).

johnh · May 13, 2019, 7:21pm

These are separate processes, not threads.
I see many processes sitting in epoll_pwait as three or four processes are running.

If I run thei minimal example I do not get a crash - ARM 64 platform, lots of RAM, Julia 1.1

jpsamaroo · May 13, 2019, 7:41pm

Julia itself uses threading for each process (via libuv) to service things like syscalls and other blocking operations. My point was that, if for some reason you were hitting an endless stream of syscalls, it would probably look like 2-3 threads running eternally (although I think libuv starts 4 by default). Especially if you’re seeing a ton of epoll_pwait, which is what libuv calls to wait on its set of file descriptors.

dkiese · May 13, 2019, 7:57pm

I am actually not sure if they are threads or processes, I’d have to check. How do u tell from htop?

dkiese · May 13, 2019, 7:58pm

But you can reproduce that not all workers participate equally in executing the loop?

johnh · May 13, 2019, 8:01pm

Remember not all workers in a parallel computation are guaranteed to finish at the same time.

jpsamaroo · May 13, 2019, 8:20pm

Once you enable “Tree view” in the htop settings, htop shows them branching off of the tree just like processes, but puts them (in my case) in a slightly dimmer color than child processes:
htop

In the above image, nvim has one thread (also called nvim), and also one child process (called languageclient).

dkiese · May 14, 2019, 4:21am

Sure. But they should at least all start more or less simultaneously and not one after another

dkiese · May 14, 2019, 4:22am

Thanks. I’ll try that out

affans · May 14, 2019, 4:46am

I don’t know if your Edit has been responded to, but try running srun instead of salloc.

dkiese · May 14, 2019, 5:20am

It has not been responded to yet. Should have written this more explicitly however. After the salloc I do srun - - pty bash and start the script afterwards. Alternatively I can also call it directly from the Julia REPL. That should be the same as running directly with srun right?

dkiese · May 14, 2019, 6:53am

Tested it, these are indeed julia processes not process threads.

johnh · May 14, 2019, 9:07am

You have a 68 core compute node? May I Ask what architecture the processors are? Intel, AMD or ARM?

dkiese · May 14, 2019, 9:08am

Its an Intel Architecture with KNL processors

johnh · May 14, 2019, 9:58am

What happens when you use 48 workers?
I ask since there is this in lubuv

assert(timeout >= -1);
  base = loop->time;
  count = 48; /* Benchmarks suggest this gives the best throughput. */
  real_timeout = timeout;

  for (;;) {
    /* See the comment for max_safe_timeout for an explanation of why
     * this is necessary.  Executive summary: kernel bug workaround.
     */
    if (sizeof(int32_t) == sizeof(long) && timeout >= max_safe_timeout)
      timeout = max_safe_timeout;

    nfds = epoll_pwait(loop->backend_fd,
                       events,
                       ARRAY_SIZE(events),
                       timeout,
                       psigset);

johnh · May 14, 2019, 9:58am

Someone please remind me how to quote code on here…

dkiese · May 14, 2019, 9:58am

use ``` before and after your code snippet

Topic		Replies	Views
Memory problem when using SlurmClusterManager.jl to add workers Julia at Scale distributed , slurm , parallel-computing	6	72	July 28, 2025
What makes a worker terminate on a cluster? General Usage parallel , cluster , distributed	16	2781	January 5, 2022
Using @distributed for loop New to Julia question , loops , parallel-computing	10	353	August 5, 2024
Memory Usage and SharedArrays Julia at Scale parallel , memory-allocation , distributed	8	1422	December 6, 2019
Struggling to figure out how I should use shared arrays on a slurm cluster using remote workers Julia at Scale question	11	1064	May 8, 2020

@distributed fails for many workers?

Related topics