@everywhere using MyLib fails when procs on remote node

package
parallel
cluster

#1

I am having troubles loading my package on remote nodes with multiple procs:

julia> @everywhere using MyLib
INFO: Precompiling module MyLib.
WARNING: Node state is inconsistent: node 2 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 10 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 12 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 3 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 7 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 14 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 17 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 4 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 16 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 15 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 11 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 9 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 8 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 6 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 5 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
WARNING: Node state is inconsistent: node 13 failed to load cache from /home/alha02/.julia/lib/v0.6/MyLib.ji. Got:
WARNING: can only precompile from node 1
ERROR: On worker 2:
can only precompile from node 1
compilecache at ./loading.jl:670
_require at ./loading.jl:456
require at ./loading.jl:398
_require_from_serialized at ./loading.jl:203
_require_search_from_serialized at ./loading.jl:236
_require at ./loading.jl:434
require at ./loading.jl:398
_require_from_serialized at ./loading.jl:203
_require_search_from_serialized at ./loading.jl:236
_require at ./loading.jl:434
require at ./loading.jl:398
eval at ./boot.jl:235
eval_ew_expr at ./distributed/macros.jl:116
#106 at ./distributed/process_messages.jl:268 [inlined]
run_work_thunk at ./distributed/process_messages.jl:56
macro expansion at ./distributed/process_messages.jl:268 [inlined]
#105 at ./event.jl:73
#remotecall_fetch#141(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:354
remotecall_fetch(::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:346
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367
remotecall_fetch(::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367
(::##1#3)() at ./distributed/macros.jl:102

...and 15 more exception(s).

Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./distributed/macros.jl:112 [inlined]
 [3] anonymous at ./<missing>:?

I have the following output from versioninfo() on the calling node (remote node is identical):

Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Does anyone know what the problem might be? I’m not sure if this is a bug or if I missed something, so posting here before filing a bug report.


#2

Have you installed MyLib on the remote node (and made sure it is up to date with your local copy)?


#3

Yes, this was from a fresh install of MyLib on both nodes.

Edit: using MyLib works perfectly on both remote and local as well.


#4

I don’t think you need to use @everywhere to make the package available on all the nodes. For example:

$ julia -p 1
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-unknown-linux-gnu

julia> using Colors

julia> remotecall_fetch(colormatch, 2, 500)
XYZ{Float64}(0.0049,0.323,0.272)

#5

facepalm Of course. I got rid of the error now, the warnings remain. But I suspect that they might be because of some connectivity issue. Thanks!


#6

See also https://github.com/JuliaLang/julia/pull/21718, which I thought should have eliminated this gotcha?


#7

I think on the first time you load a package it still needs to be just using MyLib otherwise all the nodes are trying to precompile the package and then write the .ji to the same location (if you have a shared home directory). Afterwards @everywhere using MyLib seems to work.

This is all based on experimentation though so I’m not sure how it’s supposed to work to be honest.


#8

Strange, I can’t get it working like that. Trying to load DataFrames e.g.:

julia> using DataFrames
INFO: Precompiling module DataFrames.

julia> addprocs([("user@remote_host", :auto)])
16-element Array{Int64,1}:
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17

julia> @everywhere using DataFrames
ERROR: On worker 2:
can only precompile from node 1
compilecache at ./loading.jl:670
_require at ./loading.jl:456
require at ./loading.jl:398
eval at ./boot.jl:235
eval_ew_expr at ./distributed/macros.jl:116
#106 at ./distributed/process_messages.jl:268 [inlined]
run_work_thunk at ./distributed/process_messages.jl:56
macro expansion at ./distributed/process_messages.jl:268 [inlined]
#105 at ./event.jl:73
#remotecall_fetch#141(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:354
remotecall_fetch(::Function, ::Base.Distributed.Worker, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:346
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367
remotecall_fetch(::Function, ::Int64, ::Expr, ::Vararg{Expr,N} where N) at ./distributed/remotecall.jl:367
(::##1#3)() at ./distributed/macros.jl:102

...and 15 more exception(s).

Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./distributed/macros.jl:112 [inlined]
 [3] anonymous at ./<missing>:?

And neither ssh nor tcp seems to have any problems between the two nodes…