Module loading on workers

Through some trials and errors I realized that modules, which are loaded with using, are also loaded on the already existing workers. Therefore the following code runs without an error:

using Distributed
addprocs(2)

using PyPlot

pmap(workers()) do w
   w, myid(), PyPlot.version
end

I find this behavior quite inconvenient because in a situation with many workers (to generate some data) connected to one jupyter notebook kernel (to analyze this data) I rather prefer not to load a plotting package on all workers.

Somewhat inconsistently to this behavior I found out that a module which is already loaded in the main process will not be loaded on all workers, if it was include with using before the workers were created. Julia is very smart and seems to know that the module is already there in process 1, however does not load it into the workers.

Therefore this code gives errors:

using Distributed
using PyPlot

addprocs(2)

using PyPlot

pmap(workers()) do w
   w, myid(), PyPlot.version
end

which certainly can be resolved by dressing the second using with @everywhere:

using Distributed
using PyPlot

addprocs(2)

@everywhere using PyPlot

pmap(workers()) do w
   w, myid(), PyPlot.version
end

Is this intended the intended behavior? Is there a way to load a module only on the local process even if there are already workers running?

1 Like

The intended behaviour is that using loads the packages on workers so that type definition are available and you can serialize and deserialize messages between the workers.

I consider the second behaviour where doing using; addprocs; using does not load the package on the recently added workers a bug, which i attempted to fix in https://github.com/JuliaLang/julia/pull/28860

I have the exact same usecase, although I’ve slowly trained myself to separate imports before/after addprocs depending on if you want them on the workers or not.

That said, I just played around with it a bit and the following seems like a way to remove the auto-loading feature:

using Distributed
addprocs()

filter!(!=(Distributed._require_callback), Base.package_callbacks)

using PyPlot # now only loaded on master process

pmap(workers()) do w
   w, myid(), PyPlot.version
end # will now error

and you can still @everywhere using PyPlot later and it will work. There may well be other things I haven’t thought of that this breaks though so I’d be careful messing with the internals like this.

I still find this a bit inconsistent, since statements like x=1 need an @everywhere to have an effect on workers - but using XYZ not.

And yes - if using XYZ would finish fast and only make type definitions visible on the workers I would not care. But right now I use to have 200+ workers connected to my master process. Running using Plots would probably crash my hard drive and the network.

In the meantime I thought that probably @everywhere [1] using XYZ would do the job - but no - using overrules the specification of the worker processes [1].

I would definitely vote for a some kind of @onlyhere macro.

Does import also load onto every worker? If not, you could use ImportAll.jl
https://github.com/NTimmons/ImportAll.jl
There’s probably a better way, though (like the above, or just loading the package before adding workers)