Unexpected behavior of package loading in workers using Distributed

The following phenomenon gave me quite a headache and took me a while to track down. Take the following example (using Julia 1.11.1 or 1.10.6, probably older ones as well):

julia> using Distributed, Statistics

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia> remotecall_fetch(mean, 2, [1,2])
ERROR: On worker 2:
KeyError: key Base.PkgId(Base.UUID("10745b16-79ce-11e8-11f9-7d13ad32a3b2"), "Statistics") not found
...

This is fine and expected, the worker process hasn’t loaded the Statistics module. And now consider a slightly modified example:

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia> using Statistics

julia> remotecall_fetch(mean, 2, [1,2])
1.5

julia> remotecall_fetch(typeof, 2, Statistics)
Module

Notice that I deliberately haven’t used @everywhere using Statistics. So how on earth does the worker process know about the Statistics module in this case? What makes @everywhere implicit? I tried to find anything at least hinting this kind of behavior in the manual, but failed.

Note that the same happens if I put the above lines in a source file and run Julia non-interactively. E.g., this throws an exception (as expected):

using Distributed, Statistics
addprocs(1)
println(remotecall_fetch(mean, 2, [1, 2]))

Whereas this prints 1.5:

using Distributed
addprocs(1)
using Statistics
println(remotecall_fetch(mean, 2, [1, 2]))

Two sides of the same coin:

pkg> activate --temp; add Results
  Activating new project at `/tmp/jl_0SG8Ja`
   Resolving package versions...
    Updating `/tmp/jl_0SG8Ja/Project.toml`
  [a6f57dee] + Results v0.1.2
    Updating `/tmp/jl_0SG8Ja/Manifest.toml`
  [a6f57dee] + Results v0.1.2

julia> using Distributed

julia> addprocs(1, exeflags=`--project=/dev/null`)
1-element Vector{Int64}:
 2

julia> using Results
┌ Error: Error during package callback
│   exception =
│    1-element ExceptionStack:
│    On worker 2:
│    ArgumentError: Package Results [a6f57dee-de39-11ea-01ba-9f0f7c89923c] is required but does not seem to be installed:
│     - Run `Pkg.instantiate()` to install all recorded dependencies.
...

After further digging, I found the function _require_callback() in Distributed.jl, which is registered in Base.package_callbacks when the module is loaded. This is in turn invoked when a new module is loaded via Base.require (which is behind using/import). And it is this function which notifies all workers of the loaded modules.

It is unfortunate that this behavior is not documented at all. It is also unfortunate, IMHO, that this is the default behavior, as it can be achieved with an explicit @everywhere, and now there is no way to load a module in only the main process, after (some) workers have been started (or is there?). Perhaps this is just a corner case, but still.

This is definitely an uncomfortable edge case for Distributed users, and is something we’re aiming to fix in DistributedNext.jl.

I would say that 99% of users generally want newly-added workers to inherit the same project and loaded packages as on worker 1, but we definitely don’t achieve that properly right now (unless everything is done in just the right order and the project specified manually, as you’ve shown).

Keep an eye on DistributedNext - this feature is definitely one of our highest priorities.

(And an important aside: Our goal is to merge any changes to DistributedNext, back into Distributed in the future, once they’ve been thoroughly tested and validated)

5 Likes

I had no idea - cool!

1 Like

Are you assuming that all workers must run on the exact same platform? What about heterogeneous settings?

Consider a simple setup where node 1 has a GPU and node 2 has a CPU. I would like to dispatch different implementations on these nodes. The same environment can be loaded on all workers by default, but the code that gets compiled should be different. Is that what you have in mind too?

1 Like

Great point! We’re also planning for heterogeneous system support. The approach we’re thinking of taking is attaching project/package loading information to some kind of “config” object, which will store an ordered list of operations to be performed. There will be a default config object, used when one is not otherwise specified to addprocs (to meet that 99% case most easily), but it will also be possible to build and specify a custom config object to be applied to only certain workers. This would allow you to load only the code you need for each worker.

The more important thing I’m getting at is that, if you want more than one worker to share a given code configuration and environment, it’s very hard to do that right now. I want to add tools to make this automatic when users want that, but also make it just as easy to configure to suit your needs.

1 Like

We already did that in PlataformAware.jl programming. You can find the corresponding publication Structured plataform-aware programming here:

https://www.researchgate.net/publication/377248434_Structured_platform-aware_programming

@decarvalhojuniorfh just registered his other package CloudClusters.jl, on top of PlataformAware.jl, for distributed execution with heterogeneous hardware:

It is still a prototype, but works already with AWS clusters and other cluster definitions. I suggest that you contact him to learn more about these ongoing efforts.