Sudden stop of parallel optimization with JuMP and Juniper

I have an MINLP problem that I model and solve in parallel in JuMP using Juniper with Cbc and IPOPT as mip and nlp solvers, respectively. I cannot provide an MWE, but my code works fine with shorter simulations (this might be a hint of where the issue lies). However, I can give an overall description of the scripts:

  • Script 1: optFunc.jl
using JuMP, Juniper, Ipopt, Cbc, NLopt, DataInterpolations, MAT

function optFunc(input1, ..., inputn)

    # some code using JuMP
    @variable
    @constrainnt
    @objective

    return model

end

  • Script 2: mySystem.jl
function mySystem(inputSys1, ..., inputSysn)

     # simple julia code that simulates my system

    return myOutput

end
  • Main script: main.jl
@everywhere begin
using JuMP, Plots  # and other packages
include("optFunc.jl")
include("mySystem.jl")

# Define a bunch of variables for later

# i = 1:myLimit
end

# Note: this is already outside everywhere
for i = 1:myLimit

    # Optimize
    model = optFunc(input1, ..., inputn)
    JuMP.optimize!(model)

    # Store results
    myOptSol = value.(model[:myOptVariable])

    # Simulate my system
    mySimSol = mySystem(myOptSol)

    # Set mySimSol as the initial conditions for the next optimizations
    input1, ..., inputn = mySimSol

end

In my optmization problem, I have 756 variables, of which 180 are binary. The cost and memory of the simulation is negligible compared with the optimization. myLimit = 150 in this test, but it can be much larger potentially.

To run main.jl in parallel I do the following in Julia’s REPL:
1 - cd("mypath")
2 - using Distributed
3 - addprocs(10)
4 - @everywhere using Pkg
5 - @everywhere Pkg.activate("myEnv")
6 - include("main.jl")

When I run this with loose constraints it works fine. However, when I make the constraints tighter and the optimization becomes challenging, some of the workers stop and I get the following:

From worker 3:
      From worker 3:    Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 3:    Exception: EXCEPTION_ACCESS_VIOLATION at 0x6f137880 -- mumps_cst_amf_ at C:\Users\user\.julia\artifacts\0316fdc27ab249eccb1f8e1c2fc3e111e8477070\bin\libmumps_common.dll (unknown line)
      From worker 3:    in expression starting at none:0
      From worker 3:    mumps_cst_amf_ at C:\Users\user\.julia\artifacts\0316fdc27ab249eccb1f8e1c2fc3e111e8477070\bin\libmumps_common.dll (unknown line)
      From worker 3:    __dmumps_ana_aux_m_MOD_dmumps_ana_f at C:\Users\user\.julia\artifacts\0316fdc27ab249eccb1f8e1c2fc3e111e8477070\bin\libdmumps.dll (unknown line)
      From worker 3:    dmumps_ana_driver_ at C:\Users\user\.julia\artifacts\0316fdc27ab249eccb1f8e1c2fc3e111e8477070\bin\libdmumps.dll (unknown line)
      From worker 3:    dmumps_ at C:\Users\user\.julia\artifacts\0316fdc27ab249eccb1f8e1c2fc3e111e8477070\bin\libdmumps.dll (unknown line)
      From worker 3:    .text at C:\Users\user\.julia\artifacts\0316fdc27ab249eccb1f8e1c2fc3e111e8477070\bin\libdmumps.dll (unknown line)
      From worker 3:    dmumps_c at C:\Users\user\.julia\artifacts\0316fdc27ab249eccb1f8e1c2fc3e111e8477070\bin\libdmumps.dll (unknown line)
      From worker 3:    _ZN5Ipopt20MumpsSolverInterface21SymbolicFactorizationEv at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt20MumpsSolverInterface10MultiSolveEbPKiS2_iPdbi at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt16TSymLinearSolver10MultiSolveERKNS_9SymMatrixERSt6vectorINS_8SmartPtrIKNS_6VectorEEESaIS8_EERS4_INS5_IS6_EESaISC_EEbi at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt18StdAugSystemSolver10MultiSolveEPKNS_9SymMatrixEdPKNS_6VectorEdS6_dPKNS_6MatrixES6_dS9_S6_dRSt6vectorINS_8SmartPtrIS5_EESaISC_EESF_SF_SF_RSA_INSB_IS4_EESaISG_EESJ_SJ_SJ_bi at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt15AugSystemSolver5SolveEPKNS_9SymMatrixEdPKNS_6VectorEdS6_dPKNS_6MatrixES6_dS9_S6_dRS5_SA_SA_SA_RS4_SB_SB_SB_bi at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt22LeastSquareMultipliers20CalculateMultipliersERNS_6VectorES2_ at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt25DefaultIterateInitializer18least_square_multsERKNS_10JournalistERNS_8IpoptNLPERNS_9IpoptDataERNS_25IpoptCalculatedQuantitiesERKNS_8SmartPtrINS_22EqMultiplierCalculatorEEEd at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt25DefaultIterateInitializer18SetInitialIteratesEv at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt14IpoptAlgorithm18InitializeIteratesEv at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt14IpoptAlgorithm8OptimizeEb at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt16IpoptApplication13call_optimizeEv at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt16IpoptApplication11OptimizeNLPERKNS_8SmartPtrINS_3NLPEEERNS1_INS_16AlgorithmBuilderEEE at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt16IpoptApplication11OptimizeNLPERKNS_8SmartPtrINS_3NLPEEE at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    _ZN5Ipopt16IpoptApplication12OptimizeTNLPERKNS_8SmartPtrINS_4TNLPEEE at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    IpoptSolve at C:\Users\user\.julia\artifacts\6b0cdbf534d67d502bb3bdbbbcc79a89bcf10f7f\bin\libipopt-3.dll (unknown line)
      From worker 3:    solveProblem at C:\Users\user\.julia\packages\Ipopt\vtrOr\src\Ipopt.jl:532
      From worker 3:    optimize! at C:\Users\user\.julia\packages\Ipopt\vtrOr\src\MOI_wrapper.jl:1713
      From worker 3:    optimize! at C:\Users\user\.julia\packages\MathOptInterface\YDdD3\src\Bridges\bridge_optimizer.jl:319
      From worker 3:    #process_node!#84 at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:69
      From worker 3:    process_node! at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:56 [inlined]
      From worker 3:    #branch!#85 at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:161
      From worker 3:    branch! at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:117
      From worker 3:    unknown function (ip: 0000000006c870b8)
      From worker 3:    one_branch_step! at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:320
      From worker 3:    unknown function (ip: 0000000006c6da63)
      From worker 3:    jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1691 [inlined]
      From worker 3:    do_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:674
      From worker 3:    #106 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\process_messages.jl:294
      From worker 3:    run_work_thunk at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\process_messages.jl:79
      From worker 3:    macro expansion at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\process_messages.jl:294 [inlined]
      From worker 3:    #105 at .\task.jl:356
      From worker 3:    unknown function (ip: 00000000533ccfa3)
      From worker 3:    jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1691 [inlined]
      From worker 3:    start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:721
      From worker 3:    Allocations: 218368401 (Pool: 218233746; Big: 134655); GC: 151

After this happens in several workers, I get the error:

ERROR: LoadError: TaskFailedException:
ProcessExitedException(3)
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at .\task.jl:656
 [2] wait at .\task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at .\condition.jl:106
 [4] take_buffered(::Channel{Any}) at .\channels.jl:387
 [5] take!(::Channel{Any}) at .\channels.jl:381
 [6] take!(::Distributed.RemoteValue) at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\remotecall.jl:599
 [7] remotecall_fetch(::Function, ::Distributed.Worker, ::Nothing, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\remotecall.jl:390
 [8] remotecall_fetch(::Function, ::Distributed.Worker, ::Nothing, ::Vararg{Any,N} where N) at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\remotecall.jl:386
 [9] remotecall_fetch(::Function, ::Int64, ::Nothing, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\remotecall.jl:421
 [10] remotecall_fetch(::Function, ::Int64, ::Nothing, ::Vararg{Any,N} where N) at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Distributed\src\remotecall.jl:421
 [11] macro expansion at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:515 [inlined]
 [12] (::Juniper.var"#96#100"{typeof(Juniper.one_branch_step!),Juniper.BnBTreeObj,Float64,Array{Symbol,1},Array{Int64,1},Juniper.TimeObj,Array{Symbol,1},Int64})() at .\task.jl:356

...and 1 more exception(s).

Stacktrace:
 [1] sync_end(::Channel{Any}) at .\task.jl:314
 [2] macro expansion at .\task.jl:333 [inlined]
 [3] pmap(::Function, ::Juniper.BnBTreeObj, ::Array{Any,1}, ::Float64, ::Array{Symbol,1}, ::Array{Int64,1}, ::Juniper.TimeObj) at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:482
 [4] solvemip(::Juniper.BnBTreeObj) at C:\Users\user\.julia\packages\Juniper\8wso7\src\BnBTree.jl:601
 [5] optimize!(::Juniper.Optimizer) at C:\Users\user\.julia\packages\Juniper\8wso7\src\MOI_wrapper\MOI_wrapper.jl:297 [6] optimize!(::MathOptInterface.Bridges.LazyBridgeOptimizer{Juniper.Optimizer}) at C:\Users\user\.julia\packages\MathOptInterface\YDdD3\src\Bridges\bridge_optimizer.jl:319
 [7] optimize!(::MathOptInterface.Utilities.CachingOptimizer{MathOptInterface.AbstractOptimizer,MathOptInterface.Utilities.UniversalFallback{MathOptInterface.Utilities.GenericModel{Float64,MathOptInterface.Utilities.ModelFunctionConstraints{Float64}}}}) at C:\Users\user\.julia\packages\MathOptInterface\YDdD3\src\Utilities\cachingoptimizer.jl:252
 [8] optimize!(::Model, ::Nothing; bridge_constraints::Bool, ignore_optimize_hook::Bool, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\user\.julia\packages\JuMP\Xrr7O\src\optimizer_interface.jl:185
 [9] optimize! at C:\Users\user\.julia\packages\JuMP\Xrr7O\src\optimizer_interface.jl:157 [inlined] (repeats 2 times) [10] macro expansion at .\timing.jl:174 [inlined]
 [11] top-level scope at $path\main.jl:72
 [12] include(::String) at .\client.jl:457
 [13] top-level scope at REPL[6]:1
in expression starting at $path\main.jl:58

These errors seem to happen faster when I have more workers added, which seems reasonable if it is a memory issue. Anyway, I cannot make sense of these errors, especially since the code runs perfectly in other, simpler cases. Does this happen because I am running out of memory? Is there a way to release it within the for loop so I can have larger simulations?

The current version of Ipopt and MUMPS is not thread safe, so you cannot run Ipopt models in parallel.

There is an upstream fix, but I haven’t compiled it for Julia yet.

See Ipopt.jl is not thread safe? (segmentation fault was happened.) · Issue #190 · jump-dev/Ipopt.jl · GitHub

1 Like

I have had good success using Distributed’s pmap with Ipopt in parallel. I think that I did not run into the threading issue because each process runs in a separate Julia instances.

1 Like

Were you using MUMPS or HSL? If you use a different linear algebra library it works.

The issue might be something other than thread-safety, but it’s hard to know without a MWE.

I usually run lots of very long running jobs with HSL but do testing with MUMPS, I have not yet noticed a crash with Distributed. With threads I think so.

Thanks! I think you might be right considering the amount of times mumps appears in the workers’ errors.

I will try to get a MWE isolating this issue that can help to fix it.

How would you apply pmap to this problem?

My understanding is that pmap applies a function to different elements in parallel, but what I want to do is to speed up the solution of the MINLP optimization by dividing it among the different workers, i.e. the line JuMP.optimize!(model) in main.jl. My input to the function JuMP.optimize! is just one (model in this case), but I want to solve it as fast as possible. Is this also possible with pmap?

Juniper uses a modified version of pmap internally:
https://github.com/lanl-ansi/Juniper.jl/blob/f88cfbeb016947fd8da1f71671363691f14b8e95/src/BnBTree.jl#L440-L548

@ccoffrin is saying that he doesn’t think this is normally an issue. So having a minimal working example of your problem would be good. Or, follow the instructions on Ipopt.jl to use a different linear algebra package.

Thanks for clarifying this. Also thanks to @ccoffrin for the inputs.

I have been trying to create a MWE but I am struggling to get one that fails consistently. Actually, my original code is now not even failing, which makes the previous errors even more odd. I will try to get one anyway in my spare time and let you know if I succeed.

Thanks for your comments!

1 Like