ReverseDiff.jl interference with Distributed.jl (GitHub action mysteriously starts breaking on Windows)

Something strange is happening to the Windows GitHub actions for SymbolicRegression.jl - which is either caused by or manifesting in a Distributed.jl error

The action was working 9 days ago, for all operating systems:

Today, I noticed my windows runs were breaking. The error doesn’t seem to go away with new tweaks. Confused, I re-run the last working commit with the exact same action, and those Windows runs now break!

Then, I thought: perhaps this is just the new patch of windows-latest breaking something? So I tried windows-2019 and windows-2022. Also now broken!

The weird thing is that Julia 1.5 is still working, but Julia 1.6 through 1.8 are all broken, on all versions of Windows, despite the code not changing at all.

Perhaps this is either:

  1. New GitHub action compute hardware, which breaks newer versions of Distributed.jl on windows?
  2. New patches of windows-2019 and windows-2022, which break newer versions of Distributed.jl?
  3. One of my dependencies’ updates introducing a bug, even though the issue seems to be coming from Distributed.jl which in the standard library?

Any ideas or things to try greatly appreciated!

This is the specific error (this run):

Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484
Error During Test
 at Worker 2 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484
C:\Users\runneradmin\.julia\packages\SafeTestsets\A83XK\src\SafeTestsets.jl:25

  Got exception outside of a @testWorker 5 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484


  LoadError: Distributed.ProcessExitedException(2)Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484

  
  ...and 3 more exceptions.
  
  Stacktrace:
    [1] sync_end(c::Channel{Any})
      @ Base .\task.jl:436
    [2] macro expansion
      @ .\task.jl:455 [inlined]
    [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
      @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\macros.jl:219
    [4] macro expansion
      @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\macros.jl:203 [inlined]
    [5] import_module_on_workers(procs::Vector{Int64}, filename::String, options::SymbolicRegression.CoreModule.OptionsStructModule.Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos)}, Nothing, Nothing, LossFunctions.L2DistLoss, Int64})
      @ SymbolicRegression D:\a\SymbolicRegression.jl\SymbolicRegression.jl\src\Configure.jl:199
    [6] _EquationSearch(::SymbolicRegression.CoreModule.ProgramConstantsModule.SRDistributed, datasets::Vector{SymbolicRegression.CoreModule.DatasetModule.Dataset{Float32}}; niterations::Int64, options::SymbolicRegression.CoreModule.OptionsStructModule.Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos)}, Nothing, Nothing, LossFunctions.L2DistLoss, Int64}, numprocs::Nothing, procs::Nothing, runtests::Bool, saved_state::Nothing, addprocs_function::Nothing)
      @ SymbolicRegression D:\a\SymbolicRegression.jl\SymbolicRegression.jl\src\SymbolicRegression.jl:513
    [7] EquationSearch(datasets::Vector{SymbolicRegression.CoreModule.DatasetModule.Dataset{Float32}}; niterations::Int64, options::SymbolicRegression.CoreModule.OptionsStructModule.Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos)}, Nothing, Nothing, LossFunctions.L2DistLoss, Int64}, numprocs::Nothing, procs::Nothing, multithreading::Bool, runtests::Bool, saved_state::Nothing, addprocs_function::Nothing)
      @ SymbolicRegression D:\a\SymbolicRegression.jl\SymbolicRegression.jl\src\SymbolicRegression.jl:327
    [8] EquationSearch(X::Matrix{Float32}, y::LinearAlgebra.Transpose{Float32, Matrix{Float32}}; niterations::Int64, weights::Nothing, varMap::Nothing, options::SymbolicRegression.CoreModule.OptionsStructModule.Options{Tuple{typeof(+), typeof(*)}, Tuple{typeof(cos)}, Nothing, Nothing, LossFunctions.L2DistLoss, Int64}, numprocs::Nothing, procs::Nothing, multithreading::Bool, runtests::Bool, saved_state::Nothing, addprocs_function::Nothing)
      @ SymbolicRegression D:\a\SymbolicRegression.jl\SymbolicRegression.jl\src\SymbolicRegression.jl:271
    [9] top-level scope
      @ D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\full.jl:104
   [10] include(mod::Module, _path::String)
      @ Base .\Base.jl:419
   [11] include(x::String)
      @ Main.var"##361" C:\Users\runneradmin\.julia\packages\SafeTestsets\A83XK\src\SafeTestsets.jl:23
   [12] macro expansion
      @ D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:7 [inlined]
   [13] macro expansion
      @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Test\src\Test.jl:1357 [inlined]
   [14] top-level scope
      @ D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:7
   [15] eval(m::Module, e::Any)
      @ Core .\boot.jl:368
   [16] top-level scope
      @ C:\Users\runneradmin\.julia\packages\SafeTestsets\A83XK\src\SafeTestsets.jl:23
   [17] include(fname::String)
      @ Base.MainInclude .\client.jl:476
   [18] top-level scope
      @ none:6
   [19] eval
      @ .\boot.jl:368 [inlined]
   [20] exec_options(opts::Base.JLOptions)
      @ Base .\client.jl:276
   [21] _start()
      @ Base .\client.jl:522
  in expression starting at D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\full.jl:12
Test Summary:   | Pass  Error  Total     Time
End to end test |    2      1      3  1m51.5s
ERROR: LoadError: Some tests did not pass: 2 passed, 0 failed, 1 errored, 0 broken.
in expression starting at D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:6
ERROR: Package SymbolicRegression errored during testing
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\Types.jl:67
 [2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
   @ Pkg.Operations C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\Operations.jl:1813
 [3] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool, kwargs::Base.Pairs{Symbol, IOContext{Base.PipeEndpoint}, Tuple{Symbol}, NamedTuple{(:io,), Tuple{IOContext{Base.PipeEndpoint}}}})
   @ Pkg.API C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\API.jl:431
 [4] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::IOContext{Base.PipeEndpoint}, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:coverage,), Tuple{Bool}}})
   @ Pkg.API C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\API.jl:156
 [5] test(; name::Nothing, uuid::Nothing, version::Nothing, url::Nothing, rev::Nothing, path::Nothing, mode::Pkg.Types.PackageMode, subdir::Nothing, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:coverage,), Tuple{Bool}}})
   @ Pkg.API C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\API.jl:171
 [6] top-level scope
   @ none:1

This is the line of code which the workers are failing on: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/284ec196702bb8d08769a58ac119cf14674b79a6/src/Configure.jl#L199-L201

@everywhere procs begin
    Base.MainInclude.eval(using SymbolicRegression)
end

this line imports the package on all the workers (since addprocs is called within the library, rather than having the user do it). I’ve never had an issue with this before - even have scaled up to 1000s of workers across a slurm cluster just fine.

2 Likes

Could it be caused by an update for some upstream package? When comparing the last successful CI run with commit message write new file ... with a newer CI run (2022-08-19) with commit message bump version with cp fix ... (2022-08-10), then the following packages have changed:

Package Old version New version
ChainRules v1.43.2 v.1.44.2
PreallocationTools v0.4.0 v0.4.2
RecursiveArrayTools v2.31.2 v2.32.0
StaticArrays v1.5.2 v1.5.5
StaticArraysCore v1.0.1 v1.1.0
StructArrays v0.6.11 v0.6.12
StructTypes v1.8.1 v1.9.0
TimerOutputs v0.5.20 v0.5.21
Zygote v0.6.42 v0.6.44
ReverseDiff v1.14.1
FunctionWrappers v1.1.2

ReverseDiff was introduced by PreallocationTools and FunctionWrappers was introduced by ReverseDiff.

I’ve skimmed the new packages and the changes in the other packages a bit but didn’t see anything in particular that might cause your jobs to hang. What I would do next to figure out the problem is to pin versions in your Project.toml such as PreallocationTools = "=0.4.0" to figure out which package exactly introduces the problem.

2 Likes

Thanks @rikh! This is very helpful. I will make a CI on Project.toml for each of these and do a binary search to see which one is causing the issue.

For future reference, how did you measure the dependency changes? Was it by hand, or did you use some script that looks at version availability as a function of date?

:ok_hand::ok_hand:

I did it by hand so there might be a mistake in it now that I think of it. This problem of finding the root-cause comes up time and again so probably someone has some code for it lying around.

Okay, an update: on the windows-debug branch, I basically did a manual binary search over these to see which one was breaking the Windows tests. It turns out that just fixing PreallocationTools.jl (which is not even a direct dependency!) to 0.4.0 is enough to solve this issue. I am really confused why this solves the issue, or why the issue occurs in the first place. Anyways, this compat entry is enough to solve this issue in the short term.

Another interesting thing is that PreallocationTools.jl is only for julia 1.6 and up. The julia 1.5 tests were passing, even on windows - those tests therefore did not have PreallocationTools.jl installed.

Here’s the pull request: Fix strange windows issue from ReverseDiff.jl by MilesCranmer · Pull Request #117 · MilesCranmer/SymbolicRegression.jl · GitHub

I wonder if any direct dependency is using the most recent version of PreallocationTools - perhaps that direct dependency is causing the bug.

1 Like

I generated the Manifest.toml with the PreallocationTools.jl requirement and without, and then ran a diff.

The only differences are the introduction of the FunctionWrappers.jl and ReverseDiff.jl packages in the flexible compat entry for PreallocationTools.jl. Would either of those packages, just being imported within a Distributed.jl process, cause such a bug to occur?

1 Like

Even forcing FunctionWrappers.jl and ReverseDiff.jl to be installed doesn’t give the bug. It’s only PreallocationTools.jl which breaks things…

How are you using PreallocationTools? There’s only two changes, both to LazyBufferCache.

One is that it reduces allocations by keeping views as views. The other is that

there is now a specialization to for ReverseDiff.jl tracked values.

Is the issue 0.4.1 or 0.4.2?

Our postings crossed - I describe it in some more detail here: Potential interference with Distributed.jl on Windows · Issue #34 · SciML/PreallocationTools.jl · GitHub.

PreallocationTools.jl is not a direct dependency - it is a dependency of LabelledArrays.jl which itself is a dependency of SymbolicUtils.jl.


Will try 0.4.1 now.

Seems like the issue is from 0.4.2, as 0.4.1 is working: Attempt to raise PreallocationTools to 0.4.1 · MilesCranmer/SymbolicRegression.jl@45d0461 · GitHub

Okay thanks to @ChrisRackauckas’s help, we distilled the problem to simply being the presence of

import ReverseDiff

anywhere in the package or in its dependencies. For some reason, just importing ReverseDiff.jl, presumably by a worker process, causes this strange bug in Distributed.jl, and only on Windows!

(the quirk about this problem going away on Julia 1.5 is that PreallocationTools.jl requires Julia >= 1.6, so it wasn’t being installed by the dependency tree)

1 Like

@mohamed82008 @devmotion any leads on what might be inside of ReverseDiff that could cause issues just by existing? Any known piracy?

This is the error traceback: Explicit import of ReverseDiff · MilesCranmer/SymbolicRegression.jl@6b7df7f · GitHub

Which fails around when the package is imported on the worker nodes here:

I wonder if it’s an issue with running the following code on Windows?

using Pkg
using Distributed
import ReverseDiff
procs = addprocs(4)

# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
    Base.MainInclude.eval(
        quote
            using Pkg
            Pkg.activate($$project_path)
        end,
    )
end

# Import package on workers:
@everywhere procs begin
    Base.MainInclude.eval(import ReverseDiff)
end

since this seems to be where it is failing. Maybe the second import of ReverseDiff in a worker process messes with the internals somehow?

Yep, this code gives a similar bug. I’ll submit this as a MWE issue to ReverseDiff.jl.

any leads on what might be inside of ReverseDiff that could cause issues just by existing? Any known piracy?

Note that I am aware of.

Should I submit this as a bug to julia as well? It could be some issue with Distributed.jl. I have no idea why only ReverseDiff.jl would experience this issue though (and just with import ReverseDiff, not even using!).

Ya I think that’s reasonable.

Here’s some code to reproduce this. It will:

  1. Dynamically create some worker processes.
  2. Activate the environment on each worker.
  3. Import the specified package on each worker.
  4. Close the worker processes.

I test this on 12 different packages, and on macOS, Ubuntu, and Windows. The only failure case is the combination of ReverseDiff.jl and Windows.

using Pkg, Distributed

"""Try to dynamically create workers, and import the package."""
function test(package_name)
    procs = addprocs(4)
    project_path = splitdir(Pkg.project().path)[1]
    # Import package on head worker:
    Base.MainInclude.eval(
        quote
            import $(Symbol(package_name))
        end
    )
    # Import package on worker:
    @everywhere procs begin
        Base.MainInclude.eval(
            quote
                using Pkg
                Pkg.activate($$project_path)
                import $(Symbol($package_name))
            end,
        )
    end
    rmprocs(procs)
end

packages_to_test = [
    "Distributed",  "JSON3", "LineSearches", "LinearAlgebra",
    "LossFunctions", "Optim", "Printf", "Random",
    "Reexport", "SpecialFunctions", "Zygote", "ReverseDiff",
]
for package_name in packages_to_test
    println("Testing $(package_name)...")
    test(package_name)
    println("Success!")
end
1 Like

Here’s the issue on the Julia repo: https://github.com/JuliaLang/julia/issues/46441.

1 Like

I also note that the first import must be performed. If the package is only imported on the worker processes, but not on the head worker, the error does not occur.

1 Like