Memory consumption growth with many large MILP's in JuMP

We are running into an issue with JuMP models run in an API that are leading to our servers running out of RAM. Oddly, it does appear that the memory consumption plateaus after some time/iterations. Also, it can be recreated outside of the API. A MWE is:

using REoptLite, JuMP, CPLEX

for _ in range(1,stop=1000)
    m = Model(CPLEX.Optimizer)
    r = run_reopt(m, "outage.json")
end

I have also run this MWE with Xpress. The memory consumption with both solvers looks like this:


Each problem takes about 30 seconds (faster in Xpress as you can see in the chart).

Here is outage.json in case anyone wants to recreate the MWE (to meet the maximum character limit I had to replace an array for the load profile with a built in load profile, so results could vary):

{
    "Site": {
        "longitude": -105.1686,
        "latitude": 39.7407
    },
    "ElectricUtility": {
        "outage_start_timesteps": [5100, 5200],
        "outage_durations": [10]
    },
    "Generator": {
        "existing_kw": 0.0,
        "min_turn_down_pct": 0.0,
        "only_runs_during_grid_outage": true,
        "sells_energy_back_to_grid": false
    },
    "PV": {
        "existing_kw": 3580.54,
        "array_type": 0
    },
    "ElectricLoad": {
        "doe_reference_name": "RetailStore",
        "annual_kwh": 10000000.0,
        "city": "LosAngeles"
   },
    "ElectricTariff": {
        "urdb_label": "5e611fec5457a3665e019406"
    },
    "Storage": {
        "can_grid_charge": false,
        "total_itc_pct": 0.26,
        "macrs_option_years": 5
    },
    "Financial": {
        "VoLL": 100.0,
        "two_party_ownership": true,
        "offtaker_tax_pct": 0.0,
        "owner_discount_pct": 0.1,
        "offtaker_discount_pct": 0.03
    }
}

Each problem only needs 1-2 GB or RAM, so the expected behavior is that the memory consumption should just bounce around in the 1-2 GB range, rather than climbing with iterations.

We have tried the garbage collection as suggested in Memory usage · Issue #185 · jump-dev/CPLEX.jl · GitHub. I wonder if it is a scoping issue?

I have also run the MWE within a function:

using REoptLite, JuMP, CPLEX

function iteration()
    m = Model(CPLEX.Optimizer)
    r = run_reopt(m, "outage.json")
end

for _ in range(1,stop=1000)
    iteration()
end

with similar results:

You can force the GC with GC.gc(). You could also try turning off bridges (you may need to rewrite any Interval constraints into two separate inequalities).

using REoptLite, JuMP, CPLEX

function iteration()
    m = Model(CPLEX.Optimizer; bridge_constraints = false)
    r = run_reopt(m, "outage.json")
    GC.gc()
end

You could also try something like:

function iteration()
    cpx = CPLEX.Optimizer()
    m = direct_model(cpx)
    r = run_reopt(m, "outage.json")
    finalize(cpx)
end
1 Like

Thank you I will try both today (I’ll run just 500 jobs and have results a bit sooner). I have tried the GC.gc() in the API without any changes in the memory growth, but I have not tried it in the MWE.

@odow does the finalize method work with all solvers? (We are using Xpress in production).

Yes, it should work for Xpress.

The other option is that the memory issue is something unrelated to JuMP. (It’s weird that CPLEX and Xpress are similar. Which likely means it isn’t a CPLEX.jl or Xpress.jl issue.) If each model is ~1 Gb, you wouldn’t expect it to grow gradually.

Is the JSON file staying open?

JSON.parsefile("outage.json"; use_mmap = false)

Also: what are you doing with the results? Are they staying around somewhere?

Yes agreed on the issue not being in CPLEX.jl or Xpress.jl (I ran both solvers to make sure that our problem is not in Xpress.jl).

I think that the inputs and results are good threads to pull on: in the API we are not actually opening a JSON file, but are receiving JSON (via an HTTP.jl server):

function job(req::HTTP.Request)
    d = JSON.parse(String(req.body))
    timeout = pop!(d, "timeout_seconds")
    tol = pop!(d, "tolerance")
    m = xpress_model(timeout, tol)
    @info "Starting REopt with timeout of $(timeout) seconds..."
    results = reopt(m, d)
	# fix our memory leak? https://github.com/jump-dev/CPLEX.jl/issues/185
	m = nothing
	GC.gc()
    @info "REopt model solved with status $(results["status"])."
    return HTTP.Response(200, JSON.json(results))
end

I could add d = nothing, but will that help given that d is redefined with every call to job? The results can also be several hundred MB, but is there a way to remove the results when we need to return them from job? (And does it matter since results is also redefined with each call to job?)

@odow I also just realized that in the API we are effectively setting bridge_constraints=false by using JuMP.direct_model right?

Here are the GC results, i.e. I ran:

using REoptLite, JuMP, Xpress

for _ in range(1,stop=500)
    m = Model(Xpress.Optimizer)
    r = run_reopt(m, "outage.json")
    GC.gc()
end


(Note that I ran 1,000 scenarios without GC).

Interesting that the memory usage still grows, but not as fast as without GC. Does that tell us anything?

I don’t think this is a JuMP/Xpress problem. If each solve takes 30 seconds, then you solved 500 problems and only increased the RAM by ~500mb, or ~1 mb/solve. Without the GC.gc(), it was ~2gb or ~4mb/solve.

Is some data file being kept alive? I think you’re looking in the wrong place.

All the data can be seen in the function job above: the dictionary d of inputs and the dictionary of results. Should I set d=nothing as well as do something with results?

Also, while running the test with finalize(m) (where m is the Model(Xpress.Optimzer)) I have come across a new, maybe unrelated issue, reported as an issue on Xpress.jl.

What if you just hard-code one of the result dictionaries run

function fake_results(d)
    return JSON.parsefile("cached_results.json")
end

function job(req::HTTP.Request)
    d = JSON.parse(String(req.body))
    results = fake_results(d)
    return HTTP.Response(200, JSON.json(results))
end

That should isolate the problem to HTTP or JuMP.

Not surprised there is is an issue calling finalize. You have to be a bit careful. It isn’t really meant to be called like this.

Thanks @odow, it is super helpful to get these examples of how to isolate the problem. I have run exactly what you suggested, hitting the job endpoint 5,000 times in a loop. Here is the memory consumption of that container:


So it looks like it is neither HTTP nor the results.

I wonder if it could be related to the struct that we make from the dictionary/POST from users? It is a relatively large struct, that is instantiated with every job.

Here is what it looks like if I build 1,000 Models without a solver, i.e. run

using REoptLite, JuMP

for i in range(1,stop=1000)
    m = Model()
    build_reopt!(m, "outage.json")
end


which shows an even smaller growth than the 4MB/solve.

If I add GC.gc() then I get a flat curve:


So I think that rules out any open files or the big struct right?

What’s the difference between run_reopt and build_reopt!?

What happens if you build and solve a different JuMP model (e.g., just make a huge Knapsack problem), but still with Xpress?

function knapsack(model, M, N)
    @variable(model, 0 <= x[1:N] <= 1, Bin)
    @objective(model, Max, rand(N)' * x)
    for i in 1:M
        w = rand(N)
        @constraint(model, w' * x <= N * rand())
    end
    optimize!(model)
    return value.(x)
end

The essential difference is that run_reopt calls JuMP.optimize! (build_reopt just adds all the variables and constraints).

I had the exact same thought this morning and I have a two-stage stochastic problem running 1,000 times right now…

Ok, might have a lead here.

First, I did not see a memory growth issue with large LP’s:

However, with the knapsack MILP I do see the memory growth issue:

This chart comes from running:

for _ in range(1, stop=1000)
    m = Model(CPLEX.Optimizer)
    knapsack(m, 5000, 5000)
    GC.gc()
end

The 1,000 iterations have not completed yet, but I think that we may have what we need.

The large ramp from 5 to ~7.5 GB RAM occurred during one particularly tough problem (recall that each problem uses random numbers - unfortunately I forgot to seed). Here is the CPLEX output for that problem (truncated):

Version identifier: 12.10.0.0 | 2019-11-26 | 843d4de
Found incumbent of value 0.000000 after 0.00 sec. (0.15 ticks)
Tried aggregator 1 time.
MIP Presolve eliminated 2457 rows and 0 columns.
Reduced MIP has 2543 rows, 5000 columns, and 12715000 nonzeros.
Reduced MIP has 5000 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 7.53 sec. (3885.04 ticks)
Probing time = 0.14 sec. (80.44 ticks)
Tried aggregator 1 time.
Reduced MIP has 2543 rows, 5000 columns, and 12715000 nonzeros.
Reduced MIP has 5000 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 6.31 sec. (2864.29 ticks)
Probing time = 0.09 sec. (80.44 ticks)
MIP emphasis: balance optimality and feasibility.
MIP search method: dynamic search.
Parallel mode: deterministic, using up to 7 threads.
Root relaxation solution time = 10.25 sec. (2582.00 ticks)

        Nodes                                         Cuts/
   Node  Left     Objective  IInf  Best Integer    Best Bound    ItCnt     Gap

*     0+    0                            0.0000     2502.5079              --- 
*     0+    0                            4.5083     2502.5079              --- 
      0     0       22.4763    11        4.5083       22.4763       50  398.56%
*     0+    0                           18.2528       22.4763            23.14%
*     0+    0                           21.0700       22.4763             6.67%
*     0+    0                           21.7190       22.4763             3.49%
      0     0       22.4581    11       21.7190      Cuts: 18       64    3.40%
      0     0       22.4576    11       21.7190       Cuts: 6       69    3.40%
*     0+    0                           21.8001       22.4576             3.02%
      0     0        cutoff             21.8001       22.4576       69    3.02%

Repeating presolve.
Tried aggregator 1 time.
MIP Presolve eliminated 2040 rows and 3992 columns.
Reduced MIP has 503 rows, 1008 columns, and 507024 nonzeros.
Reduced MIP has 1008 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.97 sec. (679.28 ticks)
Probing time = 0.01 sec. (5.97 ticks)
Tried aggregator 1 time.
Reduced MIP has 503 rows, 1008 columns, and 507024 nonzeros.
Reduced MIP has 1008 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.17 sec. (124.83 ticks)
Represolve time = 1.34 sec. (911.12 ticks)
Probing time = 0.01 sec. (5.97 ticks)
MIP emphasis: balance optimality and feasibility.
MIP search method: dynamic search.
Parallel mode: deterministic, using up to 7 threads.
Root relaxation solution time = 0.08 sec. (104.41 ticks)

        Nodes                                         Cuts/
   Node  Left     Objective  IInf  Best Integer    Best Bound    ItCnt     Gap

*     0+    0                           21.8001       22.4576             3.02%
      0     0       22.4576    14       21.8001       22.4576      117    3.02%
      0     0       22.4575    14       21.8001    MIRcuts: 1      119    3.02%
      0     0       22.4572    14       21.8001       Cuts: 3      122    3.01%
      0     2       22.4572    16       21.8001       22.4572      122    3.01%
Elapsed time = 48.87 sec. (25725.37 ticks, tree = 0.02 MB, solutions = 6)
     56    49       22.3666    11       21.8001       22.4564      413    3.01%
    155   133       22.2527    11       21.8001       22.4538     1066    3.00%
    449   328       22.3635    11       21.8001       22.4470     2428    2.97%
    758   551       22.1993     9       21.8001       22.4470     4097    2.97%
*   882+  605                           21.8389       22.4470             2.78%
   1154   811        cutoff             21.8389       22.4470     6091    2.78%
   1555  1073       22.4009    11       21.8389       22.4470     8149    2.78%
   2163  1629       22.1155    11       21.8389       22.4470    11606    2.78%
   2803  1934       21.9721     8       21.8389       22.4380    14608    2.74%
   3484  2289       22.4030    13       21.8389       22.4303    17250    2.71%

Performing restart 1

Repeating presolve.
Tried aggregator 1 time.
MIP Presolve eliminated 68 rows and 129 columns.
Reduced MIP has 435 rows, 879 columns, and 382365 nonzeros.
Reduced MIP has 879 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.15 sec. (104.36 ticks)
Tried aggregator 1 time.
Reduced MIP has 435 rows, 879 columns, and 382365 nonzeros.
Reduced MIP has 879 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.15 sec. (99.48 ticks)
Represolve time = 0.41 sec. (267.73 ticks)
   3490     0       22.4559    16       21.8389      Fract: 1    18702    2.68%
   3490     0       22.4546    16       21.8389       Cuts: 3    18710    2.68%
   3490     0       22.4531    16       21.8389       Cuts: 3    18715    2.68%
   3490     0       22.4530    16       21.8389     Covers: 1    18717    2.68%
   3490     2       22.4530    16       21.8389       22.4252    18717    2.68%
Elapsed time = 54.32 sec. (29410.48 ticks, tree = 0.02 MB, solutions = 7)
*  3550+    5                           21.8462       22.4252             2.65%
*  3696+  126                           21.8728       22.4252             2.53%
   4151   459       22.3315    11       21.8728       22.4252    20723    2.53%
*  4881+ 1053                           21.9738       22.4246             2.05%
   6121  1850       22.3228     9       21.9738       22.4127    25581    2.00%
  11179  6291       22.2919    10       21.9738       22.3917    40104    1.90%
  17252 11458        cutoff             21.9738       22.3775    57685    1.84%
  23033 16105       22.0311     9       21.9738       22.3666    73881    1.79%
  29135 21444       22.2727    10       21.9738       22.3607    92276    1.76%
  34271 25467       22.0781    11       21.9738       22.3537   105937    1.73%
  38077 29100       22.2112     9       21.9738       22.3527   118626    1.72%
  41051 31833    infeasible             21.9738       22.3527   128317    1.72%
  44613 34563       22.1297     9       21.9738       22.3527   137987    1.72%
Elapsed time = 71.17 sec. (38949.44 ticks, tree = 19.38 MB, solutions = 10)
  48042 38375       22.2276    11       21.9738       22.3527   151219    1.72%
  51225 40450       22.1308     9       21.9738       22.3527   158722    1.72%
  54511 43685       22.1107     7       21.9738       22.3527   170701    1.72%
  58066 46238       22.2124     9       21.9738       22.3527   179692    1.72%
  61820 49602       22.2949    12       21.9738       22.3527   191931    1.72%
  65803 53016       22.2393     9       21.9738       22.3527   204364    1.72%
  69824 56280       22.1853    10       21.9738       22.3527   216057    1.72%
* 72037+26724                           22.0167       22.3527             1.53%
  73846 59398       22.3229     9       22.0167       22.3527   227706    1.53%

Performing restart 2

Repeating presolve.
Tried aggregator 1 time.
MIP Presolve eliminated 204 rows and 414 columns.
Reduced MIP has 231 rows, 465 columns, and 107415 nonzeros.
Reduced MIP has 465 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.05 sec. (40.98 ticks)
Tried aggregator 1 time.
Reduced MIP has 231 rows, 465 columns, and 107415 nonzeros.
Reduced MIP has 465 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.04 sec. (28.23 ticks)
Represolve time = 0.17 sec. (97.59 ticks)
  75341   502       22.1519     7       22.0167       22.3270   237311    1.41%
  80452  4546       22.2649     9       22.0167       22.3270   253043    1.41%
Elapsed time = 84.80 sec. (48488.94 ticks, tree = 1.69 MB, solutions = 11)
  88883 10128       22.3208    11       22.0167       22.3270   275251    1.41%
  98582 17881       22.1827    11       22.0167       22.3270   304991    1.41%
 ...
 11847407 65973        cutoff             22.0167       22.0212 46689655    0.02%
Elapsed time = 5944.18 sec. (3093759.75 ticks, tree = 158.24 MB, solutions = 11)
 11861268 50653        cutoff             22.0167       22.0202 46712256    0.02%
 11875147 37320        cutoff             22.0167       22.0194 46730984    0.01%

Cover cuts applied:  1151
Gomory fractional cuts applied:  4

Root node processing (before b&c):
  Real time             =   48.85 sec. (25720.58 ticks)
Parallel b&c, 7 threads:
  Real time             = 5911.32 sec. (3076135.19 ticks)
  Sync time (average)   =  457.98 sec.
  Wait time (average)   =    7.31 sec.
                          ------------
Total (root+branch&cut) = 5960.17 sec. (3101855.77 ticks)

It seems as though the branch and cut memory consumption is not recovered before a new problem starts. I can run this with Xpress as well if needed, but it seems like it has to do with the memory used to store (and explore) feasible solutions. Could this be the issue and what do we do to address it?

(Also, I think that the plateau in the second chart is a period of time that something went to sleep on my laptop - but the logging script kept running ).

It seems as though the branch and cut memory consumption is not recovered before a new problem starts.

I suggest you talk to FICO or IBM support. They may have suggestions, but this doesn’t seem to be a JuMP/Julia problem.

1 Like

Could it also be instructive to restrict all problems to a single thread and see if this changes the memory profile?

Ok I will confirm by running the knapsack problem in Xpress alone. Before we switched to JuMP we called Xpress using Python’s subprocess package and we did not have a memory leak.

Yes I will try running with a single thread, good idea!

I have run three more tests:

  1. Mosel alone with knapsack.mos (see below)
  2. JuMP+Xpress knapsack.jl (see previous comments) with JULIA_NUM_THREADS=1
  3. JuMP+Xpress knapsack.jl (see previous comments) with JULIA_NUM_THREADS=1 and Xpress THREADS=1

The memory leak appears to be related to multithreading in a Julia subprocess.

1. Mosel alone

The results indicate that the memory leak is not a Mosel issue (422 iterations completed before I killed it):

2. JuMP+Xpress JULIA_NUM_THREADS=1

This one was killed during iteration 160 by Docker when it spiked to my 12 GB memory limit. You can see a baseline growth in the memory usage along with large growths during branch and bound:

3. JuMP+Xpress JULIA_NUM_THREADS=1 and Xpress THREADS=1

These problems took much longer to solve since Mosel could not parallelize the branch and bound. I killed it during iteration 256. There does not appear to be any baseline growth in memory usage. It is interesting how much more memory is used as compared to Mosel alone (an average of ~4 GB vs less than 2GB).

So I guess that the memory leak is related to multithreading inside Julia? We could set Xpress THREADS=1 in our API to stop the memory leak, but then each job will be much slower, which means that many more jobs will time out and fewer jobs will complete per time interval. We have spent a lot of time making our API fast and efficient, so the THREADS=1 solutions is a non-starter.

Does anyone know how we can still use Mosel’s concurrent solve without the memory leak? Does it matter if JULIA_NUM_THREADS=THREADS? (I will run this case next).

knapsack.mos

model "Test"
uses "mmxprs"  

parameters
  N = 5000
  M = 5000
end-parameters

declarations
  arrN = 1..N
  arrM = 1..M
  binX: array(arrN) of mpvar
end-declarations  

forall (n in arrN) do
   binX(n) is_binary  
end-do   

forall (m in arrM) do
	forall(n in arrN) rand2(n):= random  
	forall(n in arrN) binX(n) * rand2(n) <= N * random
end-do   

forall(n in arrN) rand1(n):= random  
costs := sum(n in arrN) (binX(n) * rand1(n))

maximize(costs)

case getprobstat of
  XPRS_OPT: status:="Optimum found"
  XPRS_UNF: status:="Unfinished"
  XPRS_INF: status:="Infeasible"
  XPRS_UNB: status:="Unbounded"
  XPRS_OTH: status:="Failed"
  else status:="???"
end-case

writeln("Problem status = ", status)
writeln("Costs = ", getsol(costs))
end-model