Distributed workers automatically load code on "using"

So in the docs for distributed we have:

Finally, if DummyModule.jl is not a standalone file but a package, then using DummyModule will load DummyModule.jl on all processes, but only bring it into scope on the process where using was called.

First what exactly does it mean to load, is it the same as import? And second why is this behaviour wanted?

I found it a bit annoying in a few cases, though seems easy enough to work around, but it just didn’t feel consistent to me since I have to use @everywhere using ... to actually use packages I want on all workers.

One example is that I like having some extra packages on my main process for plotting the results and so on. And now I have to make sure to load all packages I might want to use for post processing the results before I add any workers, since if I try to run using Plots after when I do want to plot the results it will fail since my workers does not have Plots installed.

EDIT:
Realised I still had problem with plots even after loading it before, when I run plot(some_results) it whined about GR_jll missing on some of the workers. Why should the plot command even affect the workers if I run it locally?

When package Foo is loaded on a worker, everything that normally happens when you call import Foo happens on that worker except that the name Foo is not added to the global namespace, so Foo.stuff on that worker won’t work.

I’m not sure its the best possible choice for behavior, I personally wouldn’t hate if things were fully imported on the workers rather just loaded, but the general idea is that it lets the workers operate on objects which were created on the master process, since for packages loaded after they’re created, they will always have the relevant package loaded, e.g.:

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia> using ComponentArrays # this loads the package on the worker too

julia> arr = ComponentArray(x=1)
ComponentVector{Int64}(x = 1)

julia> @fetch sum(arr) # this ran on the worker just fine
1

julia> @fetch ComponentArray # even though the name itself is not imported
ERROR: On worker 2:
UndefVarError: ComponentArray not defined

This could be because Plots lazily loads some packages on the first plot command, so these end up trying to get loaded on workers even though you did using Plots before the workers were added. You could try triggering this lazy load by doing a dummy plot before adding any workers, or figuring out which packages its loading (via the error message?) and loading them yourself first. Also note by default addprocs workers don’t share the same environment as the master (which will hopefully get fixed), but you can change that by doing addprocs(N, exeflags = "--project=$(Base.active_project())") which might also solve those load errors.

2 Likes

When package Foo is loaded on a worker, everything that normally happens when you call import Foo happens on that worker except that the name Foo is not added to the global namespace, so Foo.stuff on that worker won’t work.

I’m not sure its the best possible choice for behavior, I personally wouldn’t hate if things were fully imported on the workers rather just loaded, but the general idea is that it lets the workers operate on objects which were created on the master process, since for packages loaded after they’re created, they will always have the relevant package loaded, e.g.:

Okay, I guess that could make sense but I’m not sure I think it is the best solution.

So if I want to use mean instead of sum from your example I would need to run @everywhere using Statistics but only using ComponentArrays? Why not just require that things which should be used everywhere should also be loaded with @everywhere? Then it would be very clear when you load it everywhere, and you would also be able to choose to only load it locally which seems to be missing now.

…but you can change that by doing addprocs(N, exeflags = "--project=$(Base.active_project())") which might also solve those load errors.

Yeah, the problem is I don’t run processes on the local machine but on many remote machines. And I felt it would be nicer to only replicate the computation environment on those, while plotting is only on the local one. Easy to solve by also putting it on the remote ones, but it feels like it could be very easy to allow this in a simple way and since it is not needed on the workers it would be nice to not have to have it there.