Worker terminated in parallel computation with Gurobi

Hello to everyone,

I am trying to execute in parallel multiple optimizations with Gurobi inside a genetic algorithm written by myself.
The code is very long and structured in many parts, here I report the conceptual structure:

using Distributed # and other packages
addprocs(26)
@everywhere begin
    using JuMP, Gurobi # and other packages
end

@everywhere begin
    # load input data in all the processes
    if !(@isdefined env)
        const env = Gurobi.Env()
    end
end

fitness_values = SharedArray{Float64}(26)

function par_OF(population)
        @sync @distributed for i in 1:length(population)
                fitness_values[i] = OF(population[i]) # in the OF function is executed the optimization with Gurobi
           end
    end
    return fitness_values
end

population = [rand(15) for n in 1:26] # creation of initial population

conv_param = 1.
while conv_param > 1e-2
    fitness_values = par_OF(population)
    # update population according to fitness_values
    # update conv_param according to fitness_values
end

When executing the code, I am encountering a strange error, which happens at different iterations according to the specific run.

      From worker 20:
      From worker 20:   Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 20:   Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffdc38bcd99 -- GRBrelaxmodel at C:\gurobi912\win64\bin\gurobi91.DLL (unknown line)
      From worker 20:   in expression starting at none:0
      From worker 20:   GRBrelaxmodel at C:\gurobi912\win64\bin\gurobi91.DLL (unknown line)
      From worker 20:   GRBrelaxmodel at C:\gurobi912\win64\bin\gurobi91.DLL (unknown line)
      From worker 20:   GRBfeasrelax at C:\Users\umbe\.julia\packages\Gurobi\FliRK\src\gen91\libgrb_api.jl:308
      From worker 20:   unknown function (ip: 000000000219eac5)
      From worker 20:   UL_OF at C:\Users\umbe\OneDrive\Script\UL\UL_OF.jl:69 
      From worker 20:   OF at C:\Users\umbe\OneDrive\Script\UL\UL_opt.jl:64   
      From worker 20:   macro expansion at C:\Users\umbe\OneDrive\Script\UL\UL_opt.jl:99 [inlined]
      From worker 20:   #133 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\macros.jl:303
      From worker 20:   #178 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\macros.jl:83
      From worker 20:   unknown function (ip: 000000000219ef23)
      From worker 20:   jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1788 [inlined]
      From worker 20:   do_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:713
      From worker 20:   #107 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:274
      From worker 20:   run_work_thunk at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:63
      From worker 20:   unknown function (ip: 0000000032bfb526)
      From worker 20:   run_work_thunk at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:72
      From worker 20:   #100 at .\task.jl:429
      From worker 20:   unknown function (ip: 0000000032bfb0b3)
      From worker 20:   jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1788 [inlined]
      From worker 20:   start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:877
      From worker 20:   Allocations: 1417948818 (Pool: 1410181436; Big: 7767382); GC: 1053
Worker 20 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#648")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base .\stream.jl:892
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base .\stream.jl:900
 [3] unsafe_read
   @ .\io.jl:724 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base .\io.jl:723
 [5] read!
   @ .\io.jl:725 [inlined]
 [6] deserialize_hdr_raw
   @ C:\Users\umbe\AppData\Local\Programs\Julia-1.7.3\share\julia\stdlib\v1.7\Distributed\src\messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed C:\Users\umbe\AppData\Local\Programs\Julia-1.7.3\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:165
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed C:\Users\umbe\AppData\Local\Programs\Julia-1.7.3\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:126
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed .\task.jl:429
ERROR: LoadError: Unhandled Task ERROR: ProcessExitedException(20)
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:381
 [2] (::Distributed.var"#177#179"{var"#133#135"{Vector{Any}, SharedVector{Float64}}, UnitRange{Int64}})()
   @ Distributed .\task.jl:400
TaskFailedException

    nested task error: ProcessExitedException(20)
    Stacktrace:
     [1] sync_end(c::Channel{Any})
       @ Base .\task.jl:381
     [2] (::Distributed.var"#177#179"{var"#133#135"{Vector{Any}, SharedVector{Float64}}, UnitRange{Int64}})()
       @ Distributed .\task.jl:400
Stacktrace:
  [1] sync_end(c::Channel{Any})
    @ Base .\task.jl:381
  [2] macro expansion
    @ .\task.jl:400 [inlined]
  [3] par_OF(population::Vector{Any}, num_elite::Int64, elite_fitness::Vector{Float64}, fitness_values::SharedVector{Float64})
    @ Main C:\Users\umbe\OneDrive\Script\UL\UL_opt.jl:98
  [4] select_parents(population::Vector{Any}, num_parents::Int64, num_elite::Int64, elite_fitness::Vector{Float64}, fitness_values::SharedVector{Float64})
    @ Main C:\Users\umbe\OneDrive\Script\UL\UL_opt.jl:120
  [5] genetic_algorithm(population_size::Int64, num_generations::Int64, num_parents::Int64, mutation_rate::Float64, num_elite::Int64, lb::Vector{Float64}, ub::Vector{Float64}, lc::Vector{Float64}, uc::Vector{Float64}, init_population::Vector{Vector{Float64}}, fitness_values::SharedVector{Float64})
    @ Main C:\Users\umbe\OneDrive\Script\UL\UL_opt.jl:187
  [6] top-level scope
    @ C:\Users\umbe\OneDrive\Script\UL\UL_opt.jl:222
  [7] include(fname::String)
    @ Base.MainInclude .\client.jl:451
  [8] top-level scope
    @ C:\Users\umbe\OneDrive\Script\Main.jl:40
  [9] eval
    @ .\boot.jl:373 [inlined]
 [10] include_string(mapexpr::typeof(identity), mod::Module, code::String, filename::String)
    @ Base .\loading.jl:1196
in expression starting at C:\Users\umbe\OneDrive\Script\UL\UL_opt.jl:222      
in expression starting at C:\Users\umbe\OneDrive\Script\Main.jl:40

I am having some issues trying to understand what went wrong…it may be something related to the execution of the relaxed version of the problem, which is addressed in my code when the original problem results to be unfeasible, but I do not know how to fix it.
Here instead I found that the error Unhandled Task ERROR: EOFError: read end of file

is the error you get when one of the parallelized workers hits an error

but again, since I do not know the reasons that cause it I do not know what I should do.

Can anyone help me, please?

GRBfeasrelax at C:\Users\umbe.julia\packages\Gurobi\FliRK\src\gen91\libgrb_api.jl:308

Are you calling the C API of Gurobi?

There can be a number of causes for an error like this. It’s hard to tell without a reproducible example. Can you share the code?

Thank you for your answer, @odow.The code is a complete mess, I am afraid that it is not very readable… However, I can summarize the Gurobi optimization:

model = direct_model(Gurobi.Optimizer(env))
# independent variables definition
# objective function definition
# constraints definition
JuMP.optimize!(model)

if JuMP.termination_status(model) != OPTIMAL # address and execute a partially relaxed version of the problem
    grb_model = backend(model)
    feasobjP = Ref{Cdouble}()
    check_rlx = GRBfeasrelax(grb_model, 0, 1, C_NULL, C_NULL, rhspen, feasobjP)

    if check_rlx == 0 # check if the relaxed problem is correctly addressed
        grb_term_stat = GRBoptimize(grb_model)
        OFres_rlx = feasobjP[]
    end
else
    OFres = JuMP.objective_value(model)
end

In addition, I have done some"experiments" and I have found that this issue does not appear if I avoid to address some constraints that can often (but not always) make the problem unfeasible (both the original version and the partially relaxed version). However, it appears as a strange behaviour to me, since unfeasibility can be managed by GRBoptimize…

Does the error happens if you don’t use @distributed?

I don’t see any immediate issues, so it could be something else in your code, a bug in Gurobi.jl, a bug in Gurobi, or a bug in Julia. It’s impossible to say without a reproducible example.

the issue seems related to GRBfeasrelax, so you could try using JuMP.relax_with_penalty! instead: Debugging · JuMP

Apparently the error does not appear if I do not use @distributed.
Another thing that I have noticed is that I had not correctly updated the vector rhspen, which contains the penalties to apply to the constraints violations, therefore it had less elements than the actual number of constraints. I do not know if this had an influence, but since this correction the error does not appear anymore (even addressing all the constraints that I need).
This is a good thing for me, because it allows to execute the code without stops, even if the nature of the problem is still not clear to me.
I understand that a minimum working example is fundamental to receive help and I will try to provide one if the issue will come back. For the moment I thank you very much for your support!

1 Like

Oh if the vector was the wrong size then that is the problem.

1 Like