Memory consumption growth with many large MILP's in JuMP

Thank you for the suggestions!

Yes I started to play with these settings to see how it will impact solve time, which is a high priority for our app given that we have users submitting 1,000’s of jobs/day. However, I ran into unrecognized control parameter RESOURCESTRATEGY · Issue #130 · jump-dev/Xpress.jl · GitHub

I have also looked into using kubernetes to restart the Julia pods based off of some metric. Unfortunately I have come up empty handed so far (we are not iterating in the app so there is no distinct counter of problems solved). The only idea that would work is using a cron job to restart the pods every day. However, this is just a band-aid and I would rather figure out the true problem and fix it.

I have not found a reason to contact FICO yet as I have not identified any issue with Xpress. The problem, as you noted, appears to be in Julia.

I will keep pursuing this avenue in parallel.

Just to note: the app is not iterating over problems and is running 12-20 Julia instances. We have no control over how complex any given problem is nor how many problems are run in any given time interval (except for limiting user POST’s). We would like to process as many jobs in parallel as possible to give users the best experience possible. Unfortunately, all solutions so far (using 1 THREAD or limiting the solver’s memory use) will lead to slower jobs, which negates 1,000’s of hours of work spent making the app faster.

Calling GC manually, was not a solution?

No, we deployed GC in the API a few weeks ago and it has not solved the problem. You can see in the charts above that is just slows the memory growth.

Per my latest test in Unable to call Xpress.createprob · Issue #128 · jump-dev/Xpress.jl · GitHub

This happens in linux but not in windows.

That is correct: our app does not use Threads.

I have not because testing with FICO Xpress alone does not lead to memory growth (see previous posts).

My understanding is that in the Xpress “alone” case you were calling is as a subprocess, which opens and completely closes a process right after it finishes.

In terms of julia this would be equivalent to something like:

myscript.jl =

using REoptLite, JuMP, Xpress
m = Model(Xpress.Optimizer)
r = run_reopt(m, "outage.json")

than call as:

for _ in range(1,stop=1000)
    run(`julia myscript.jl`)

Sorry, what I meant by Xpress “alone” is this case:

in which I used a shell script to loop over mosel knapsack.mos

1 Like

Here is the chart with THREADS=1: what can we conclude from it? Note that this is running the REopt model (not the knapsack model).

The first two lines are using all available (8) threads on my macbook. Here is an example output from Xpress:

julia_xpress     | Minimizing MILP 
julia_xpress     | Original problem has:
julia_xpress     |     175682 rows       140421 cols       662091 elements      8783 globals
julia_xpress     |        146 inds
julia_xpress     | Presolved problem has:
julia_xpress     |      40235 rows        28021 cols       149634 elements        23 globals
julia_xpress     |        103 inds
julia_xpress     | Will try to keep branch and bound tree memory usage below 8.0Gb
julia_xpress     | Starting concurrent solve with dual, primal and barrier (5 threads)
julia_xpress     | 
julia_xpress     |                            Concurrent-Solve,   1s
julia_xpress     |             Dual                      Primal                     Barrier      
julia_xpress     |     objective   dual inf       objective   sum inf         p.obj.     d.obj.  
julia_xpress     |  D  39481861.   .0000000 |  p  3.222E+08   .0000000 |  B  7.144E+11 -3.028E+12
julia_xpress     |  D  46242287.   .0000000 |  p  3.150E+08   .0000000 |  B  91252224.  32992757.
julia_xpress     |  D  53886486.   .0000000 |  p  2.988E+08   .0000000 |  B  66886949.  66071451.
julia_xpress     |  D  55357306.   .0000000 |  p  2.988E+08   .0000000 |           crossover     
julia_xpress     | ----- interrupted ------ | ----- interrupted ------ | ------- optimal --------
julia_xpress     | Concurrent statistics:
julia_xpress     |       Dual: 21867 simplex iterations, 2.45s
julia_xpress     |     Primal: 5077 simplex iterations, 2.44s
julia_xpress     |    Barrier: 48 barrier and 0 simplex iterations, 2.44s
julia_xpress     |             Barrier used 5 threads 5 cores, L1\L2 cache: 32K\6144K
julia_xpress     |             Barrier used AVX support
julia_xpress     | Optimal solution found
julia_xpress     | 
julia_xpress     |    Its         Obj Value      S   Ninf  Nneg        Sum Inf  Time
julia_xpress     |      0       66772948.83      P      0     0        .000000     3
julia_xpress     | Barrier solved problem
julia_xpress     |   48 barrier iterations in 3s
julia_xpress     | 
julia_xpress     | Final objective                         : 6.677294882798262e+07
julia_xpress     |   Max primal violation      (abs / rel) : 1.791e-12 / 1.697e-12
julia_xpress     |   Max dual violation        (abs / rel) : 1.137e-13 / 1.016e-13
julia_xpress     |   Max complementarity viol. (abs / rel) :       0.0 /       0.0
julia_xpress     | All values within tolerances
julia_xpress     | 
julia_xpress     | Starting root cutting & heuristics
julia_xpress     | 
julia_xpress     |  Its Type    BestSoln    BestBound   Sols    Add    Del     Gap     GInf   Time
julia_xpress     | c         73666143.19  66772948.83      1                  9.36%       0      3
julia_xpress     |    1  M   73666143.19  66772948.83      1     10      0    9.36%       3      4
julia_xpress     |    2  K   73666143.19  68497142.85      1     34      7    7.02%      23      4
julia_xpress     |    3  K   73666143.19  68497142.85      1      2     13    7.02%       2      4
julia_xpress     |    4  K   73666143.19  68497142.85      1      1      1    7.02%       2      5
julia_xpress     |    5  K   73666143.19  68497142.85      1      0      1    7.02%       2      5
julia_xpress     |    6  G   73666143.19  68497142.85      1      0      0    7.02%       2      5
julia_xpress     | Heuristic search started
julia_xpress     | Heuristic search stopped
julia_xpress     | 
julia_xpress     | Cuts in the matrix         : 25
julia_xpress     | Cut elements in the matrix : 67
julia_xpress     | Will try to keep branch and bound tree memory usage below 8.0Gb
julia_xpress     | 
julia_xpress     | Starting tree search.
julia_xpress     | Deterministic mode with up to 7 running threads and up to 16 tasks.
julia_xpress     | 
julia_xpress     |     Node     BestSoln    BestBound   Sols Active  Depth     Gap     GInf   Time
julia_xpress     |        1  73666143.19  68673927.30      1      2      1    6.78%       2     14
julia_xpress     |        2  73666143.19  68673927.30      1      1      2    6.78%       2     15
julia_xpress     |        5  73666143.19  73666076.42      1      0      3    0.00%       1     15
julia_xpress     |  *** Search completed ***     Time:    15 Nodes:          5
julia_xpress     | Number of integer feasible solutions found is 1
julia_xpress     | Best integer solution found is  73666143.19
julia_xpress     | Best bound is  73666143.19
julia_xpress     | Uncrunching matrix

It doesn’t hurt to tell them you would be interested in commercial support for Xpress.jl. They may have suggestions for parameters to set to help with the memory usage.

Here is the chart with THREADS=1 : what can we conclude from it?

That Xpress allocates less memory because it is not parallelizing the branch and bound.

I think you need something like this:

function run_reopt(inputs)
    model = direct_model(Xpress.Optimizer())
    # ... do stuff
    results = get_results(model)
    xpress = backend(model)
    return results

I opened an issue for more technical discussion Memory leaks · Issue #131 · jump-dev/Xpress.jl · GitHub

1 Like

The memory growth with

for i in range(1,stop=500)
    m = direct_model(Xpress.Optimizer(OUTPUTLOG = 0))
    r = run_reopt(m, "outage.json")

starts lower and is slower than any thing tried so far, but still appears to grow:

I will contact FICO next week about the questions in Memory leaks · Issue #131 · jump-dev/Xpress.jl · GitHub

1 Like

FWIW I thought that there might be a plateau in the average memory usage but I now suspect that the memory usage can continue to grow indefinitely: after increasing our server memory to 128 GB from 64 GB the app just filled it out after a few days

Note that the models run on this server are not controlled (anyone can POST any scenario), so the relatively flat periods of memory use could also be periods of relative inactivity.

See Does the GC respond based on memory pressure?. Seems like @ericphanson is having similar troubles.

@ericphanson mentioned running GC more often as a fix to managing memory usage in containerized Julia sessions; we are running GC.gc() after every JuMP model is solved but the memory usage continues to grow. Is there something else that we should be doing or do we need to wait for the issue mentioned by @jameson to be addressed?

From your picture above (up three), the difference between the blue and red definitely seems like Julia isn’t finalizing the C backend aggressively. This probably explains why CPLEX had something similar.

So that’s part of the problem, but it doesn’t explain the gradual increase even in the red line.

Is this latest (bottom) picture with the manual finalizer? How many solves are called during this time? It really looks like Xpress is allocating something, and then not releasing it when finalized. Are you on the latest version of Xpress?

@nlaws from what I understand you are not using the free method (that also releases the license), did you check with FICO if that is necessary?

Yes, using finalize(backend(m)).

Over the last 7 days 14,160 models were solved.

We are on v8.0.4 Looks like v8.11 is the latest. I will see if we can get it.

I am working on contacting FICO. They don’t make it easy.

1 Like

v8.0.4 or v8.4?
8.0 is from 2016

Yikes yes 8.0. I didn’t realize that it is that old! I’ll make this priority one.

1 Like