I think I’ve found a bug with @everywhere on 1.3 but it’s hard to reproduce. Basically, it takes very long to load the package when using a cluster with multiple nodes (workload manager is Slurm). See the following results, particularly the output of @time ...
julia> using Distributed, ClusterManagers
julia> addprocs(SlurmManager(512), N=16)
512-element Array{Int64,1}:t of 512
2
3
4
⋮
511
512
513
julia> @time @everywhere using mypackage
519.233305 seconds (17.21 M allocations: 878.619 MiB, 0.08% gc time)
It takes about 7-8 minutes!
At first I thought it was precompiling, but that’s not the case (I tested this by running using mypackage to trigger the compilation locally, and the compiled files are automatically copied to other nodes using Infiniband network). I don’t think precompilation is the problem.
I initially thought this was a cluster configuration problem, but running it on 1.0.5 is much much quicker. In a fresh launch of 1.0.5 and doing addprocs(SlurmManager(512), N=16) as above, I get
julia> @time @everywhere using mypackage
8.020800 seconds (5.05 M allocations: 245.780 MiB, 1.75% gc time)
(this further shows the precompilation is not a problem).
I don’t know what has happened to @everywhere between 1.0.5 to 1.3. The problem also dosn’t exist when just doing addprocs(32) (i.e. just using the local processors rather than distributed over multiple machines). Seems like a bug that @everywhere has regressed so much. Willing to give ssh access to a Julia contributor to debug if possible. I can also create a issue on github if this is indeed a bug.
I raised a similar (probably the same) issue here:
I am still not satisfied with the current situation but now know how to avoid the greatest pitfalls, e.g. first load all the packages you need on the master, then add workers, then load packages you definitely need at workers.
Changing the topology doesn’t do anything. Its still about 8 minutes for using mypackage… I just can’t figure out what’s going on. I am not sure what the problem is, but it seems to be with using statements inside my package itself.
For example, removing using DataFrames reduces the time to about 3 minutes. A huge improvement over the 8 minutes. Again, to reiterate, the packages are already precompiled, so the slowdown must be coming from the __init__ functions or running the code for the global consts, if there are any.
But I am still not sure why it takes longer on 1.3 than 1.0.5. There is still code loading on both.
ok - nowadays I usually take a ElasticManager to manage my workers and tweak the Worker connect command (see on the linked page) such that all the packages are all loaded before ClusterManagers.elastic_worker(...) is called. Depending on how you start your workers this is more gentle on the filesystem. I guess all this loading takes so long because all of a sudden several nodes (in my case 100+) want to load stuff (precompiled or not) from some files.
The macro essentially run remotecall_eval(Main, procs, ex), which is again slow after tests. Increasing nworkers makes the excution time (seen from master) to increase linearly. This indicates that it is essentially serialises the job …
I checked the code inside remotecall_eval(Main, procs, ex) and it uses @sync_add what is not defined for me as of julia 1.5.3.
Or I don’t understand why I get @sync_add not defined error messages when I try to use it in a @sync block.
Hi Sixzero,
I have tested @everywhere and pmap() on HPC and the results are here:
Basically I tested the parallelized version of findpi on HPC and put timestamp in between lines. Then I have pin-pointed the time-consuming part, which is the macro @everywhere. Did you have similar situations ? How did you walk around it ?
@moble Thanks for the workaround! I am still seeing quite a lot of overhead even after applying your precompile/sysimage approach. The same block of code executed in @everywhere vs locally is something like 10x slower. Is this consistent with your experience?
For my use case, at least, the scaling was almost perfect once I precompiled. I’m no expert here, but one question jumps out at me: Is that 10x consistent? Like, even after you run whatever computation that is, if you keep the subprocesses up and just ask to re-run it, is it still 10x slower?
If it is consistent, I would guess that you’re doing something that doesn’t actually benefit from parallelization. Is there lots of communication, for example? Is this a too-simple test?
If it’s not consistent, maybe you’re making some subtle mistake on the precompilation.
Did you precompile the code that you actually need to run? In my gist, I used @everywhere 1+2, but you could run some short version of the computation you actually need there instead.
Did you make sure to combine both the main process’s precompilation output and the subprocess precompilation output?
Did you look at the output from the compilation, to check for any warnings?
Are you entirely sure you’re actually using the resulting sysimage?