Memory consumption growth with many large MILP's in JuMP

joaquimg · June 16, 2021, 12:19pm

I agree it seems that Julia is not being aggressive on freeing memory.
Maybe its because julia only know about Xpress’ pointer and not about what it points to? (I am as far as you can think of a julia memory management expert)

I would add oscar’s list:

Callling GC sometimes.
Working-out a solution to manually finalize Xpress (We would have to dig into Xpress’ createprob issue, FICO might be of help here)
Run mosel with julia’s run, I believe it will run smoothly. (just eliminate many tests and possibilities you raised)

nlaws · June 16, 2021, 1:11pm

Regarding using Base.run for solving models: the reason we switched from using Python to Julia is JuMP. Is there a way to still use JuMP with Base.run?

One of the major drawbacks to using Base.run (or Python’s subprocess) is that we have to write all inputs to disk to load them into mosel, as well as have mosel write outputs to disk and read them back in to memory (using Julia or Python). At least, that is how we got ins/outs from mosel in the past. I suppose that there is a better way using the C API (?), but that is what JuMP does!

joaquimg · June 16, 2021, 1:26pm

I definitely don’t want to suggest using run as a final solution.
It is an experiment to rule out some possibilities.

nlaws · June 16, 2021, 1:46pm

Thank you for the suggestions!

Yes I started to play with these settings to see how it will impact solve time, which is a high priority for our app given that we have users submitting 1,000’s of jobs/day. However, I ran into unrecognized control parameter RESOURCESTRATEGY · Issue #130 · jump-dev/Xpress.jl · GitHub

I have also looked into using kubernetes to restart the Julia pods based off of some metric. Unfortunately I have come up empty handed so far (we are not iterating in the app so there is no distinct counter of problems solved). The only idea that would work is using a cron job to restart the pods every day. However, this is just a band-aid and I would rather figure out the true problem and fix it.

I have not found a reason to contact FICO yet as I have not identified any issue with Xpress. The problem, as you noted, appears to be in Julia.

I will keep pursuing this avenue in parallel.

Just to note: the app is not iterating over problems and is running 12-20 Julia instances. We have no control over how complex any given problem is nor how many problems are run in any given time interval (except for limiting user POST’s). We would like to process as many jobs in parallel as possible to give users the best experience possible. Unfortunately, all solutions so far (using 1 THREAD or limiting the solver’s memory use) will lead to slower jobs, which negates 1,000’s of hours of work spent making the app faster.

joaquimg · June 16, 2021, 3:01pm

Calling GC manually, was not a solution?

nlaws · June 16, 2021, 3:18pm

No, we deployed GC in the API a few weeks ago and it has not solved the problem. You can see in the charts above that is just slows the memory growth.

joaquimg · June 16, 2021, 5:15pm

Per my latest test in https://github.com/jump-dev/Xpress.jl/issues/128

This happens in linux but not in windows.

nlaws · June 16, 2021, 5:49pm

That is correct: our app does not use Threads.

nlaws · June 16, 2021, 5:50pm

I have not because testing with FICO Xpress alone does not lead to memory growth (see previous posts).

joaquimg · June 16, 2021, 6:30pm

My understanding is that in the Xpress “alone” case you were calling is as a subprocess, which opens and completely closes a process right after it finishes.

In terms of julia this would be equivalent to something like:

myscript.jl =

using REoptLite, JuMP, Xpress
m = Model(Xpress.Optimizer)
r = run_reopt(m, "outage.json")

than call as:

for _ in range(1,stop=1000)
    run(`julia myscript.jl`)
end

nlaws · June 16, 2021, 7:15pm

Sorry, what I meant by Xpress “alone” is this case:

in which I used a shell script to loop over mosel knapsack.mos

nlaws · June 16, 2021, 8:39pm

Here is the chart with THREADS=1: what can we conclude from it? Note that this is running the REopt model (not the knapsack model).

The first two lines are using all available (8) threads on my macbook. Here is an example output from Xpress:

julia_xpress     | Minimizing MILP 
julia_xpress     | Original problem has:
julia_xpress     |     175682 rows       140421 cols       662091 elements      8783 globals
julia_xpress     |        146 inds
julia_xpress     | Presolved problem has:
julia_xpress     |      40235 rows        28021 cols       149634 elements        23 globals
julia_xpress     |        103 inds
julia_xpress     | Will try to keep branch and bound tree memory usage below 8.0Gb
julia_xpress     | Starting concurrent solve with dual, primal and barrier (5 threads)
julia_xpress     | 
julia_xpress     |                            Concurrent-Solve,   1s
julia_xpress     |             Dual                      Primal                     Barrier      
julia_xpress     |     objective   dual inf       objective   sum inf         p.obj.     d.obj.  
julia_xpress     |  D  39481861.   .0000000 |  p  3.222E+08   .0000000 |  B  7.144E+11 -3.028E+12
julia_xpress     |  D  46242287.   .0000000 |  p  3.150E+08   .0000000 |  B  91252224.  32992757.
julia_xpress     |  D  53886486.   .0000000 |  p  2.988E+08   .0000000 |  B  66886949.  66071451.
julia_xpress     |  D  55357306.   .0000000 |  p  2.988E+08   .0000000 |           crossover     
julia_xpress     | ----- interrupted ------ | ----- interrupted ------ | ------- optimal --------
julia_xpress     | Concurrent statistics:
julia_xpress     |       Dual: 21867 simplex iterations, 2.45s
julia_xpress     |     Primal: 5077 simplex iterations, 2.44s
julia_xpress     |    Barrier: 48 barrier and 0 simplex iterations, 2.44s
julia_xpress     |             Barrier used 5 threads 5 cores, L1\L2 cache: 32K\6144K
julia_xpress     |             Barrier used AVX support
julia_xpress     | Optimal solution found
julia_xpress     | 
julia_xpress     |    Its         Obj Value      S   Ninf  Nneg        Sum Inf  Time
julia_xpress     |      0       66772948.83      P      0     0        .000000     3
julia_xpress     | Barrier solved problem
julia_xpress     |   48 barrier iterations in 3s
julia_xpress     | 
julia_xpress     | Final objective                         : 6.677294882798262e+07
julia_xpress     |   Max primal violation      (abs / rel) : 1.791e-12 / 1.697e-12
julia_xpress     |   Max dual violation        (abs / rel) : 1.137e-13 / 1.016e-13
julia_xpress     |   Max complementarity viol. (abs / rel) :       0.0 /       0.0
julia_xpress     | All values within tolerances
julia_xpress     | 
julia_xpress     | Starting root cutting & heuristics
julia_xpress     | 
julia_xpress     |  Its Type    BestSoln    BestBound   Sols    Add    Del     Gap     GInf   Time
julia_xpress     | c         73666143.19  66772948.83      1                  9.36%       0      3
julia_xpress     |    1  M   73666143.19  66772948.83      1     10      0    9.36%       3      4
julia_xpress     |    2  K   73666143.19  68497142.85      1     34      7    7.02%      23      4
julia_xpress     |    3  K   73666143.19  68497142.85      1      2     13    7.02%       2      4
julia_xpress     |    4  K   73666143.19  68497142.85      1      1      1    7.02%       2      5
julia_xpress     |    5  K   73666143.19  68497142.85      1      0      1    7.02%       2      5
julia_xpress     |    6  G   73666143.19  68497142.85      1      0      0    7.02%       2      5
julia_xpress     | Heuristic search started
julia_xpress     | Heuristic search stopped
julia_xpress     | 
julia_xpress     | Cuts in the matrix         : 25
julia_xpress     | Cut elements in the matrix : 67
julia_xpress     | Will try to keep branch and bound tree memory usage below 8.0Gb
julia_xpress     | 
julia_xpress     | Starting tree search.
julia_xpress     | Deterministic mode with up to 7 running threads and up to 16 tasks.
julia_xpress     | 
julia_xpress     |     Node     BestSoln    BestBound   Sols Active  Depth     Gap     GInf   Time
julia_xpress     |        1  73666143.19  68673927.30      1      2      1    6.78%       2     14
julia_xpress     |        2  73666143.19  68673927.30      1      1      2    6.78%       2     15
julia_xpress     |        5  73666143.19  73666076.42      1      0      3    0.00%       1     15
julia_xpress     |  *** Search completed ***     Time:    15 Nodes:          5
julia_xpress     | Number of integer feasible solutions found is 1
julia_xpress     | Best integer solution found is  73666143.19
julia_xpress     | Best bound is  73666143.19
julia_xpress     | Uncrunching matrix

odow · June 16, 2021, 10:59pm

It doesn’t hurt to tell them you would be interested in commercial support for Xpress.jl. They may have suggestions for parameters to set to help with the memory usage.

Here is the chart with THREADS=1 : what can we conclude from it?

That Xpress allocates less memory because it is not parallelizing the branch and bound.

I think you need something like this:

function run_reopt(inputs)
    model = direct_model(Xpress.Optimizer())
    # ... do stuff
    results = get_results(model)
    xpress = backend(model)
    finalize(xpress)
    return results
end

I opened an issue for more technical discussion Memory leaks · Issue #131 · jump-dev/Xpress.jl · GitHub

nlaws · June 18, 2021, 12:34pm

The memory growth with

for i in range(1,stop=500)
    m = direct_model(Xpress.Optimizer(OUTPUTLOG = 0))
    r = run_reopt(m, "outage.json")
    finalize(backend(m))
    GC.gc()
end

starts lower and is slower than any thing tried so far, but still appears to grow:

I will contact FICO next week about the questions in Memory leaks · Issue #131 · jump-dev/Xpress.jl · GitHub

nlaws · July 15, 2021, 7:36pm

FWIW I thought that there might be a plateau in the average memory usage but I now suspect that the memory usage can continue to grow indefinitely: after increasing our server memory to 128 GB from 64 GB the app just filled it out after a few days

Note that the models run on this server are not controlled (anyone can POST any scenario), so the relatively flat periods of memory use could also be periods of relative inactivity.

odow · July 15, 2021, 8:55pm

See Does the GC respond based on memory pressure?. Seems like @ericphanson is having similar troubles.

nlaws · July 15, 2021, 9:08pm

@ericphanson mentioned running GC more often as a fix to managing memory usage in containerized Julia sessions; we are running GC.gc() after every JuMP model is solved but the memory usage continues to grow. Is there something else that we should be doing or do we need to wait for the issue mentioned by @jameson to be addressed?

odow · July 15, 2021, 10:21pm

From your picture above (up three), the difference between the blue and red definitely seems like Julia isn’t finalizing the C backend aggressively. This probably explains why CPLEX had something similar.

So that’s part of the problem, but it doesn’t explain the gradual increase even in the red line.

Is this latest (bottom) picture with the manual finalizer? How many solves are called during this time? It really looks like Xpress is allocating something, and then not releasing it when finalized. Are you on the latest version of Xpress?

joaquimg · July 15, 2021, 10:22pm

@nlaws from what I understand you are not using the free method (that also releases the license), did you check with FICO if that is necessary?

nlaws · July 15, 2021, 10:39pm

Yes, using finalize(backend(m)).

Over the last 7 days 14,160 models were solved.

We are on v8.0.4 Looks like v8.11 is the latest. I will see if we can get it.

Topic		Replies	Views
Memory usage in JuMP model Optimization (Mathematical)	1	836	September 27, 2018
Large memory usage in JuMP Optimization (Mathematical) jump	1	280	July 12, 2023
JuMP, excessive memory usage? Optimization (Mathematical) question	16	3252	February 14, 2018
Where is the excessive memory consumption Performance jump	13	297	September 10, 2024
Slow generation of a large LP model in JuMP Optimization (Mathematical)	11	3834	April 24, 2018

Memory consumption growth with many large MILP's in JuMP

Related topics