Is JuMP.@variable slow when the number gets large?

I have a Unit commitment (24 hours, 5000 scenarios, IEEE118) problem, that has these two lines

@time JuMP.@variable(m, pA[i=_2(t,T,1), _0(i,t,S), agi])
@time JuMP.@variable(m, pf[i=_2(t,T,0), _0(i,t,S), 1:size(F,1)])

where

_3(i, t, s) = (s-1)*(i>t)+1 # return `s` or `1`
_0(i, t, S) = 1:_3(i, t, S) # S in this case is 5000
_2(t, T, p) = t-p:t+T

and F is the PTDF matrix, agi is a Vector{Int}.
The length of pA is approximately 24 * 5000 * 13 == 1560000,
whereas the length of pf is approximately 24 * 5000 * 186 == 22320000.

And their runtime performance is

adding pA variables...
  8.563364 seconds (36.34 M allocations: 1.792 GiB, 29.52% gc time, 2.12% compilation time)
adding pf variable...
146.426164 seconds (535.10 M allocations: 30.329 GiB, 39.43% gc time, 0.11% compilation time)

Note that the code are inside function bodies (and actually inside my custom package) so they had been compiled.

Do you think adding 1.5 million variables using 8.5 seconds is reasonable?

And also, 8.56 / 13 * 186 = 122.47 < 146.426 seconds.

My model is a Gurobi’s direct model.

I think building large models (in my case, large num of scenarios) is really a bottleneck (Gurobi can almost solve that model in 1 hour, but the modeling code runs for nearly 2 hours). I wonder what is a good solution?

julia> JuMP.optimize!(m)
Gurobi Optimizer version 13.0.1 build v13.0.1rc0 (linux64gpu - "Ubuntu 24.04.4 LTS")

CPU model: AMD EPYC 7763 64-Core Processor, instruction set [SSE2|AVX|AVX2]
Thread count: 128 physical cores, 256 logical processors, using up to 32 threads

GPU model: NVIDIA RTX A6000, CUDA compute version 8.6, NVIDIA driver compatible with CUDA version 13

Non-default parameters:
TimeLimit  3600
Method  6
PDHGGPU  1
Threads  32

Optimize a model with 65080188 rows, 63480926 columns and 1130514176 nonzeros (Min)
Model fingerprint: 0x608a9aa9
Model has 0 linear objective coefficients
Variable types: 54510250 continuous, 8970676 integer (0 binary)
Coefficient statistics:
  Matrix range     [8e-06, 2e+01]
  Objective range  [0e+00, 0e+00]
  Bounds range     [2e-04, 6e+00]
  RHS range        [2e-07, 5e+01]

Presolve removed 0 rows and 0 columns (presolve time = 50s)...
Presolve removed 230001 rows and 231033 columns (presolve time = 73s)...
Presolve removed 42424933 rows and 42156055 columns (presolve time = 2295s)...
Presolve time: 2294.90s
Presolved: 22655255 rows, 21324871 columns, 176926424 nonzeros
Variable types: 12354819 continuous, 8970052 integer (8970052 binary)

Start PDHG on GPU

                       Objective                Residual
     Iter       Primal          Dual         Primal    Dual     Compl    Time
        0   0.00000000e+00  0.00000000e+00  5.25e+01 0.00e+00  0.00e+00 2962s
      202   0.00000000e+00  0.00000000e+00  5.25e+01 0.00e+00  0.00e+00 2967s
     1651   0.00000000e+00  4.58872503e-07  4.34e-09 1.66e-06  3.55e-12 3006s
     1751   0.00000000e+00 -7.24419036e-09  1.49e-09 2.59e-08  1.83e-18 3008s

PDHG solved model in 1751 iterations and 3008.98 seconds (5455.37 work units)
Optimal objective 0.00000000e+00


Root crossover log...

  166145 DPushes remaining with DInf 0.0000000e+00              3076s

 13843535 PPushes remaining with PInf 2.0464937e-05              3088s
 10311631 PPushes remaining with PInf 2.0047059e-04              3111s
 3091561 PPushes remaining with PInf 9.9011976e-05              3591s
Crossover changed status from Optimal to Time Limit

Root relaxation: time limit, 10783309 iterations, 871.47 seconds (2725.82 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0          -    0               -    0.00000      -     - 3611s

Explored 1 nodes (10783309 simplex iterations) in 3613.65 seconds (5600.42 work units)
Thread count was 32 (of 256 available processors)

Solution count 0

Time limit reached
Best objective -, best bound 0.000000000000e+00, gap -

User-callback calls 413829, time in user-callback 0.32 sec

julia> JuMP.solution_summary(m)
solution_summary(; result = 1, verbose = false)
├ solver_name          : Gurobi
├ Termination
│ ├ termination_status : TIME_LIMIT
│ ├ result_count       : 0
│ ├ raw_status         : Optimization terminated because the time expended exceeded the value specified in the TimeLimit parameter.
│ └ objective_bound    : 0.00000e+00
├ Solution (result = 1)
│ ├ primal_status        : NO_SOLUTION
│ ├ dual_status          : NO_SOLUTION
│ └ relative_gap         : 1.00000e+100
└ Work counters
  ├ solve_time (sec)   : 3.61365e+03
  ├ simplex_iterations : 10783309
  ├ barrier_iterations : 0
  └ node_count         : 1

julia> m
A JuMP Model
├ mode: DIRECT
├ solver: Gurobi
├ objective_sense: MIN_SENSE
│ └ objective_function_type: JuMP.AffExpr
├ num_variables: 63480926
├ num_constraints: 103492279
│ ├ JuMP.VariableRef in MOI.GreaterThan{Float64}: 17710727
│ ├ JuMP.AffExpr in MOI.EqualTo{Float64}: 49250187
│ ├ JuMP.VariableRef in MOI.LessThan{Float64}: 11730688
│ ├ JuMP.AffExpr in MOI.GreaterThan{Float64}: 230001
│ ├ JuMP.AffExpr in MOI.LessThan{Float64}: 15600000
│ └ JuMP.VariableRef in MOI.Integer: 8970676
└ Names registered in the model
  └ :p, :pA, :pf, :r, :rint, :u, :ur, :v, :x, :ϖ

It doesn’t look like it can solve in an hour? You hit a time limit.

Did you disable string names?

Let me have a try.

The Presolve+PDHGGPU part is reasonably fast. But the Crossover part is somewhat slow since the system is large.

But that’s just the root node. You’re a long way from even finding a feasible solution. Your model is incredibly large, so I’m not particularly surprised it takes a long time to build. At that scale JuMP might not be the right tool for the job.

I might want to use C-API? like

julia> function test(N)
           m = Settings.Model() # a JuMP's gurobi direct model
           JuMP.unset_silent(m)
           o = m.moi_backend
           v = fill(0., N)
           t = fill(Int8(67), N) # variable type: all CONTINUOUS
           @time JuMP.@variable(m, [1:N]) # by JuMP
           @time Settings.addvars(o,N,v,v,v,t) # by Gurobi.GRBaddvars
           Settings.opt_and_ter(m)
       end;

julia> test(3) # let it compile for the first run
Set parameter OutputFlag to value 1
  0.022536 seconds (35.39 k allocations: 1.754 MiB, 99.56% compilation time)
  0.010390 seconds (13.19 k allocations: 650.391 KiB, 99.63% compilation time)
2 # OPTIMAL STATUS CODE

julia> test(100_000_000) # start a test
Set parameter OutputFlag to value 1
127.085007 seconds (300.00 M allocations: 23.548 GiB, 42.83% gc time)
 28.775192 seconds (1 allocation: 16 bytes)
Gurobi Optimizer version 13.0.1 build v13.0.1rc0 (linux64gpu - "Ubuntu 24.04.4 LTS")

CPU model: AMD EPYC 7763 64-Core Processor, instruction set [SSE2|AVX|AVX2]
Thread count: 128 physical cores, 256 logical processors, using up to 1 threads

GPU model: NVIDIA RTX A6000, CUDA compute version 8.6, NVIDIA driver compatible with CUDA version 13

Non-default parameters:
Threads  1

Optimize a model with 0 rows, 200000000 columns and 0 nonzeros (Min)
Model fingerprint: 0x58ab99dd
Model has 0 linear objective coefficients
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  Objective range  [0e+00, 0e+00]
  Bounds range     [0e+00, 0e+00]
  RHS range        [0e+00, 0e+00]

Presolve time: 26.98s
Iteration    Objective       Primal Inf.    Dual Inf.      Time
       0      handle free variables                         39s

Solved in 0 iterations and 51.90 seconds (5.11 work units)
Optimal objective  0.000000000e+00
2

And here is an example that use Gurobi.jl only (I think it’s my first time in such style).

import Gurobi
module Settings
    import Gurobi
    addvars(o,N,ov,lv,uv,tv) = Gurobi.GRBaddvars(o,N,0,C_NULL,C_NULL,C_NULL,ov,lv,uv,tv,C_NULL)
    addvar(o,obj,l,u,t) = Gurobi.GRBaddvar(o,0,C_NULL,C_NULL,obj,l,u,t,C_NULL)
    const _Default = Dict{String, Any}("OutputFlag" => 0, "Threads" => 1) # "Method" => 6, "PDHGGPU" => 1
    Env() = Gurobi.Env(_Default)
    Model() = Gurobi.Optimizer(Env())
end;
function test()
    m = Settings.Model()
    e = Gurobi.GRBgetenv(m)
    Gurobi.GRBsetintparam(e, "OutputFlag", 1)
    Settings.addvars(m,4,C_NULL,C_NULL,C_NULL,C_NULL)
    Gurobi.GRBupdatemodel(m)
    ci = Cint[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]
    cd = Cdouble[3, 3, 4, -5, 0, 0, 6, -2, 3, -2, 3, 2, 0, 8, -1, -5]
    bg = Cint[0, 4, 8, 12]
    sn = fill(Int8(61), 4) # sense is ==
    b = Cdouble[1, 2, 3, 4]
    Gurobi.GRBaddconstrs(m, 4, 16, bg, ci, cd, sn, b, C_NULL)
    Gurobi.GRBsetdblparam(e, "TimeLimit", 7.)
    Gurobi.GRBoptimize(m)
    t = Ref{Cint}()
    Gurobi.GRBgetintattr(m, "Status", t)
    t.x == 2 || error()
    x = fill(NaN, 4)
    Gurobi.GRBgetdblattrarray(m, "X", 0, 4, x)
    a = [3 3 4 -5; 0 0 6 -2; 3 -2 3 2; 0 8 -1 -5] # cf. `cd`
    a * x - b
end;
test()
1 Like

fill(Int8(61), 4)

Note that you can do fill(Cchar('='), 4) instead:

julia> N = 3
3

julia> fill(Int8(61), N)
3-element Vector{Int8}:
 61
 61
 61

julia> fill(Cchar('='), N)
3-element Vector{Int8}:
 61
 61
 61

I gave up, because I think it’s not sane for a human to write a realistic model with pure C-API without JuMP. (Although it can be useful at certain crucial places.)

1 Like

I think JuMP is still the right tool, but maybe the modeler (e.g. me) needs to figure out how to write efficient code.

I wrote a modeling code for the UC problem (IEEE118 system(=118 Node * 186 Lines), 24 hours, two-stage stochastic with 10000 scenarios). It took 2713.6 seconds for JuMP to build that (in monolithic form) model(=mono). It can be noticed that 46.62% of the modeling time is on gc (How can it be improved?).

My memory resource is

julia> Sys.total_memory() / 1024^3
503.6526107788086

The experiment is like

julia> mono = Settings.Model();

julia> @time Uc.Mono.Model!(mono,t,T,F,GD,WD,DD,uvH,xH,pAH;PfMax=6.47,KupRe=0.1,SuCostK=20.,LsCost=50.,WcCost=5.);
2713.660804 seconds (3.35 G allocations: 766.005 GiB, 46.62% gc time, 229754 lock conflicts)

julia> JuMP.optimize!(mono)
Gurobi Optimizer version 13.0.1 build v13.0.1rc0 (linux64gpu - "Ubuntu 24.04.4 LTS")

CPU model: AMD EPYC 7763 64-Core Processor, instruction set [SSE2|AVX|AVX2]
Thread count: 128 physical cores, 256 logical processors, using up to 1 threads

GPU model: NVIDIA RTX A6000, CUDA compute version 8.6, NVIDIA driver compatible with CUDA version 13

Non-default parameters:
SoftMemLimit  400
Method  6
PDHGGPU  1
Threads  1

Optimize a model with 64300279 rows, 86250375 columns and 4891590992 nonzeros (Min)
Model fingerprint: 0xe1e5c5b8
Model has 0 linear objective coefficients
Variable types: 86250336 continuous, 39 integer (39 binary)
Coefficient statistics:
  Matrix range     [2e-06, 2e+01]
  Objective range  [0e+00, 0e+00]
  Bounds range     [2e-04, 6e+00]
  RHS range        [4e-09, 5e+01]

Presolve removed 0 rows and 0 columns (presolve time = 209s)...
Presolve removed 0 rows and 0 columns (presolve time = 241s)...
Presolve removed 0 rows and 0 columns (presolve time = 284s)...
Presolve removed 1150005 rows and 1150493 columns (presolve time = 301s)...
Presolve removed 1150007 rows and 1150493 columns (presolve time = 319s)...
Presolve removed 1150007 rows and 1150493 columns (presolve time = 440s)...
Presolve removed 1610009 rows and 1610495 columns (presolve time = 440s)...
Presolve removed 1610009 rows and 1610495 columns
Presolve time: 441.26s

Explored 0 nodes (0 simplex iterations) in 1125.56 seconds (1136.77 work units)
Thread count was 1 (of 256 available processors)

Solution count 0

Memory limit reached
Best objective -, best bound -, gap -

User-callback calls 452733, time in user-callback 0.33 sec

julia> mono
A JuMP Model
├ mode: DIRECT
├ solver: Gurobi
├ objective_sense: FEASIBILITY_SENSE
├ num_variables: 86250375
├ num_constraints: 224840938
│ ├ JuMP.VariableRef in MOI.GreaterThan{Float64}: 83260323
│ ├ JuMP.AffExpr in MOI.EqualTo{Float64}: 48990213
│ ├ JuMP.VariableRef in MOI.LessThan{Float64}: 77280297
│ ├ JuMP.AffExpr in MOI.GreaterThan{Float64}: 230001
│ ├ JuMP.AffExpr in MOI.LessThan{Float64}: 15080065
│ └ JuMP.VariableRef in MOI.ZeroOne: 39

The modeling process is like:

  1. I create all the variables in sequential programming (I don’t know how this can be parallelized)
  2. I add the technical constraints of generators in sequential programming, since these constraints are small and sparse.
  3. I spawned S=10000 tasks to add the system-level constraints (power balance, line flow limits). (There are 186 lines in the power system, which I think is the heaviest part.)

I can share you the code if you’re interested.

I tried another solve with S=8000, Gurobi runs for about 3800 seconds in the presolve phase and finally still reports Memory limit reached. I don’t understand whether it means my GPU memory is not sufficient or something else?

You should contact Gurobi support.

Since you’re running out of memory in presolve, I assume it’s unrelated to your GPU.

Your problem is just likely too big. I don’t think it’ really possible to solve a MIP with nearly 10^8 variables and constraints.

Yes, I reset PDHGGPU to 0, it still reports MemLimit so it’s not due to GPU. The presolve consumes much memory (about 100+ GB I guess).

I just intended to confirm that the monolithic formulation of 2SSP when S is large is not practical—otherwise we won’t need to study decomposition algorithms.