Error starting Distributed in Linux and Julia 1.10.x

I am unable to use Distributed in Julia 1.10.x on a linux machine. Works fine on macOS ARM, so I am not sure if this is general enough to open an issue. It also works as expected using Julia 1.9.4 on the same machine.

Here’s the error message on Julia 1.10.4:

$ julia -p 4
ERROR: Unable to load dependent library /opt/local/julia/julia-1.10.4/bin/../lib/julia/libjulia-codegen.so.1.10
Message:libLLVM-15jl.so: failed to map segment from shared object
ERROR: TaskFailedException

    nested task error: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1093
     [2] worker_from_id
       @ /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1090 [inlined]
     [3] remote_do
       @ /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:557 [inlined]
     [4] kill(manager::Distributed.LocalManager, pid::Int64, config::WorkerConfig; exit_timeout::Int64, term_timeout::Int64)
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/managers.jl:738
     [5] kill
       @ /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/managers.jl:736 [inlined]
     [6] create_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig)
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:604
     [7] setup_launched_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:545
     [8] (::Distributed.var"#45#48"{Distributed.LocalManager, Vector{Int64}, WorkerConfig})()
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:501

    caused by: Unable to read host:port string from worker. Launch command exited with error?
    Stacktrace:
     [1] read_worker_host_port(io::Base.PipeEndpoint)
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:330
     [2] connect(manager::Distributed.LocalManager, pid::Int64, config::WorkerConfig)
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/managers.jl:575
     [3] create_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig)
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:600
     [4] setup_launched_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:545
     [5] (::Distributed.var"#45#48"{Distributed.LocalManager, Vector{Int64}, WorkerConfig})()
       @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:501
Stacktrace:
  [1] sync_end(c::Channel{Any})
    @ Base ./task.jl:448
  [2] macro expansion
    @ ./task.jl:480 [inlined]
  [3] addprocs_locked(manager::Distributed.LocalManager; kwargs::@Kwargs{exeflags::Cmd})
    @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:490
  [4] addprocs_locked
    @ /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:456 [inlined]
  [5] addprocs(manager::Distributed.LocalManager; kwargs::@Kwargs{exeflags::Cmd})
    @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:450
  [6] addprocs
    @ /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:443 [inlined]
  [7] addprocs(np::Int32; restrict::Bool, kwargs::@Kwargs{exeflags::Cmd})
    @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/managers.jl:465
  [8] addprocs
    @ /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/managers.jl:462 [inlined]
  [9] process_opts(opts::Base.JLOptions)
    @ Distributed /opt/local/julia/julia-1.10.4/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1364
 [10] #invokelatest#2
    @ ./essentials.jl:892 [inlined]
 [11] invokelatest
    @ ./essentials.jl:889 [inlined]
 [12] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:272
 [13] _start()
    @ Base ./client.jl:552
1 Like

Is this using an official build of Julia?

1 Like

Provide versioninfo() output.

Yes, this is an official version downloaded from the website.

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 40 virtual cores)
Environment:
  JULIA_CPU_TARGET = generic
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_PYTHONCALL_EXE = /opt/local/mamba/envs/py310/bin/python
1 Like

I should add that I’ve tried this in different linux machines and sometimes the errors with libraries are different. All machines have RHEL 9.1, although they may have slightly different versions of some packages. Here are other examples of libraries failing to load when I start julia -p 4:

ERROR: Unable to load dependent library /opt/local/julia/julia-1.10.2/bin/../lib/julia/libjulia-codegen.so.1.10
Message:libLLVM-15jl.so: failed to map segment from shared object
ERROR: Unable to load dependent library /opt/local/julia/julia-1.10.2/bin/../lib/julia/libjulia-internal.so.1.10
Message:libunwind.so.8: failed to map segment from shared object

and another machine:

ERROR: Unable to load dependent library /opt/local/julia/julia-1.10.4/bin/../lib/julia/libstdc++.so.6
Message:/opt/local/julia/julia-1.10.4/bin/../lib/julia/libstdc++.so.6: failed to map segment from shared object
ERROR: Unable to load dependent library /opt/local/julia/julia-1.10.4/bin/../lib/julia/libjulia-internal.so.1.10
Message:/opt/local/julia/julia-1.10.4/bin/../lib/julia/libjulia-internal.so.1.10: failed to map segment from shared object

When I saw the libstdc++.so.6 errors I followed some advice from a Julia issue, where I linked instead libstdc++.so.6 from the system, and not the one that shipped with Julia. Unfortunately, that did not solve the issue, and often led to errors loading libjulia-internal.so.1.10.

Do you have a startup file? What happens when you run julia -p 4 --startup-file=no

No startup file, error is the same.

1 Like

I would add that I am experiencing similar problems already since the release of 1.10 on our Linux cluster as well. Julia version is 1.10.4 and julia is installed through official juliaup binaries. It looks like there is a problem when several julia processes are launched at the same time. For example

using Distributed
addprocs(10)

will fail with loading dependent library error while

for _ in 1:10
     addprocs(1)
end

works without problems. So as a personal workaround I currently have a branch of ClusterManagers.jl which adds some sleep commands between launching 2 jobs, which kind of solves the problem for me. I thought the error was somehow unique to our architecture and hard to make a reproducible example so I did not report it yet.

EDIT: if helpful, my versioninfo:

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
2 Likes