@everywhere takes a very long time when using a cluster

affans · March 8, 2020, 5:27pm

I think I’ve found a bug with @everywhere on 1.3 but it’s hard to reproduce. Basically, it takes very long to load the package when using a cluster with multiple nodes (workload manager is Slurm). See the following results, particularly the output of @time ...

julia> using Distributed, ClusterManagers
julia> addprocs(SlurmManager(512), N=16) 
512-element Array{Int64,1}:t of 512
   2
   3
   4
   ⋮
 511
 512
 513

julia> @time @everywhere using mypackage
519.233305 seconds (17.21 M allocations: 878.619 MiB, 0.08% gc time)

It takes about 7-8 minutes!

At first I thought it was precompiling, but that’s not the case (I tested this by running using mypackage to trigger the compilation locally, and the compiled files are automatically copied to other nodes using Infiniband network). I don’t think precompilation is the problem.

I initially thought this was a cluster configuration problem, but running it on 1.0.5 is much much quicker. In a fresh launch of 1.0.5 and doing addprocs(SlurmManager(512), N=16) as above, I get

julia> @time @everywhere using mypackage
  8.020800 seconds (5.05 M allocations: 245.780 MiB, 1.75% gc time)

(this further shows the precompilation is not a problem).

I don’t know what has happened to @everywhere between 1.0.5 to 1.3. The problem also dosn’t exist when just doing addprocs(32) (i.e. just using the local processors rather than distributed over multiple machines). Seems like a bug that @everywhere has regressed so much. Willing to give ssh access to a Julia contributor to debug if possible. I can also create a issue on github if this is indeed a bug.

anon94023334 · March 9, 2020, 4:17am

Just wondering: what happens if you change the topology to :master_worker instead of :all_to_all, which is the default?

pfarndt · March 9, 2020, 8:27am

I raised a similar (probably the same) issue here:

I am still not satisfied with the current situation but now know how to avoid the greatest pitfalls, e.g. first load all the packages you need on the master, then add workers, then load packages you definitely need at workers.

affans · March 9, 2020, 1:33pm

Is there a way to look at the code change in @everywhere from 1.0.5 to 1.3? That might give us a clue.

affans · March 12, 2020, 8:17pm

Changing the topology doesn’t do anything. Its still about 8 minutes for using mypackage… I just can’t figure out what’s going on. I am not sure what the problem is, but it seems to be with using statements inside my package itself.

For example, removing using DataFrames reduces the time to about 3 minutes. A huge improvement over the 8 minutes. Again, to reiterate, the packages are already precompiled, so the slowdown must be coming from the __init__ functions or running the code for the global consts, if there are any.

But I am still not sure why it takes longer on 1.3 than 1.0.5. There is still code loading on both.

nantonel · November 26, 2020, 4:55pm

I’m experiencing the same problem in Julia 1.5.2. Do you know if the issue has been solved? Or any tip for avoiding this? Thanks!

affans · November 26, 2020, 7:46pm

I have not seen a solution to this yet, but 1.6 should help with faster pre-compilation speeds.

pfarndt · November 26, 2020, 8:50pm

ok - nowadays I usually take a ElasticManager to manage my workers and tweak the Worker connect command (see on the linked page) such that all the packages are all loaded before ClusterManagers.elastic_worker(...) is called. Depending on how you start your workers this is more gentle on the filesystem. I guess all this loading takes so long because all of a sudden several nodes (in my case 100+) want to load stuff (precompiled or not) from some files.

https://juliahub.com/docs/ClusterManagers/2gedf/0.4.0/#Using--ElasticManager--dynamically-adding-workers-to-a-cluster

Lian_Yunlong · January 17, 2021, 7:42am

same finding here.

The macro essentially run remotecall_eval(Main, procs, ex), which is again slow after tests. Increasing nworkers makes the excution time (seen from master) to increase linearly. This indicates that it is essentially serialises the job …

Lian_Yunlong · January 17, 2021, 7:50am

https://github.com/JuliaLang/julia/issues/39291

Sixzero · January 17, 2021, 10:26am

I checked the code inside remotecall_eval(Main, procs, ex) and it uses @sync_add what is not defined for me as of julia 1.5.3.
Or I don’t understand why I get @sync_add not defined error messages when I try to use it in a @sync block.

Lian_Yunlong · January 17, 2021, 10:50am

https://github.com/JuliaLang/julia/blob/52883caab4333cc5d416125b2bc5ccc45e33496b/base/task.jl#L450

Lian_Yunlong · January 17, 2021, 10:55am

Hi Sixzero,
I have tested @everywhere and pmap() on HPC and the results are here:

Basically I tested the parallelized version of findpi on HPC and put timestamp in between lines. Then I have pin-pointed the time-consuming part, which is the macro @everywhere. Did you have similar situations ? How did you walk around it ?

vancleve · September 7, 2021, 9:23pm

I have also seen very very very slow @everywhere using pkgs on slurm. Its quite bad. 5-10 minutes for 100s of cores even on Julia 1.6.2

moble · September 7, 2021, 9:32pm

There’s a workaround described here. Basically, you have to precompile the code that lets processes talk to each other.

bgroenks · February 11, 2022, 2:20pm

@moble Thanks for the workaround! I am still seeing quite a lot of overhead even after applying your precompile/sysimage approach. The same block of code executed in @everywhere vs locally is something like 10x slower. Is this consistent with your experience?

moble · February 11, 2022, 2:47pm

For my use case, at least, the scaling was almost perfect once I precompiled. I’m no expert here, but one question jumps out at me: Is that 10x consistent? Like, even after you run whatever computation that is, if you keep the subprocesses up and just ask to re-run it, is it still 10x slower?

If it is consistent, I would guess that you’re doing something that doesn’t actually benefit from parallelization. Is there lots of communication, for example? Is this a too-simple test?

If it’s not consistent, maybe you’re making some subtle mistake on the precompilation.

Did you precompile the code that you actually need to run? In my gist, I used @everywhere 1+2, but you could run some short version of the computation you actually need there instead.
Did you make sure to combine both the main process’s precompilation output and the subprocess precompilation output?
Did you look at the output from the compilation, to check for any warnings?
Are you entirely sure you’re actually using the resulting sysimage?

Topic		Replies	Views
Why does julia use thousands of cpu-hours to compute 1+2? General Usage hpc , distributed	10	3316	August 16, 2021
@everywhere using MyLib fails when procs on remote node General Usage package , parallel , cluster	7	1228	June 29, 2017
"Failed to precompile" error on worker processes General Usage	17	4827	June 15, 2020
Unable to create parallel package General Usage question , package , distributed	1	132	April 19, 2024
Parallel computing, invalid checksum in cache file General Usage parallel	4	880	October 19, 2018

@everywhere takes a very long time when using a cluster

Related topics