Memory consumption growth with many large MILP's in JuMP

Probably not helpful, since the many things already tried, but… I had one problem with memory scaling up in a parallel application and, initially, I used something like this as a workaround:

        if istaskdone(t[ispawn])
          if options.GC && (Sys.free_memory() / Sys.total_memory() < options.GC_threshold)
            GC.gc() # why we need this anyway??? There should not be so much garbage.

meaning that I launched garbage collection after each thread was finished whenever the memory usage was too high (but I had complete control over the code).

That being said (since you already tried things like that), I finally found out a type instability in my code which was the cause of that memory leak, and the overall problem was solved (it was a tricky one from the point of view of what I knew at the time). Thus, if you not have already exhausted this possibility, I would suggest to carefully see if everything is type stable where it should be.


Just saw this in the Julia manual in Multi-Threading Β· The Julia Language

Compute-bound, non-memory-allocating tasks can prevent garbage collection from running in other threads that are allocating memory. In these cases it may be necessary to insert a manual call to GC.safepoint() to allow GC to run. This limitation will be removed in the future.

May be totally unrelated, but is it intended that all AxisArray fields in that struct are abstractly typed? What does @code_warntype of your job endpoint and reopt_run look like? I seem to remember there being problems with threaded type-unstable code leading to a lot of allocations…

I will check that but we are seeing the memory leak in the knapsack.jl problem as well (which does not have any structs).

Thank you for the tip! However, our app is not doing the threading so we do not have control over the tasks. The solver (CPLEX or Xpress checked so far) is called by JuMP (or MathOptInterface?), and the solver uses multiple threads. Also, I do not think that it is a type instability leading to the memory leak because the memory leak occurs with the knapsack.jl problem (unless the type instability is in JuMP, MathOptInterface, or another dependency like MutableArithmetics).

The problem I had in my case was very similar to that one found by Sukera above (just to mention, I know that it is not related, at least completely, to your problem).

Maybe you can try to track if there is a type instability somewhere by following the function calls down the code. I don’t know if you know (I learned this not long time ago), you can use @code_warntype in the inner functions by calling Main.@code_warntype from anywhere. Something like:

function test(x)
   y = inner_function(x)
   return y


julia> function test(x)
          Main.@code_warntype inner_function(x) # check inner_function call
          y = inner_function(x)
          return y
test (generic function with 1 method)

Ok thank you I will try Main.@code_warntype on our inner function calls and report back what I find. I have never used it before so thank you for the example!

Should I be concerned by the many Anys? Or is there something specific to look for regarding the type stability issue and multi-threading?

I don’t know much about how JuMP works, but there you have some function that returns an expression. I guess on model building things are probably frequently not typed. One should not have instabilities in the more number crunching routines.

Youre looking at the macro creation code, not the actual runtime code.

I found this issue that matches the description of my issue, but it looks like a fix was merged into master by following the issue links:

However, my issue is that a subprocess called by Julia is multithreading, and I can get the memory leak even with JULIA_NUM_THREADS=1. The only β€œfix” so far has been to set Xpress THREADS=1, but this causes our API to be unbearably slow. I guess I should raise an issue on the Julia repo? Anyone have thoughts on this? Is there a way to make a MWE of calling a multithreading subprocess without JuMP+solver?

BTW I tried GC.safe_point() in the global scope and function scope with no changes to the memory growth (using the knapsack.jl in a loop).

So if you set THREADS=1, all those plots with growing memory become very stable?
Maybe you can send them for reference here.

Or maybe is just the program that is so slow now that it has no time to reach the memory growth ramp?

Any chance you can plot with a minimum number of iteration?

See this chart above in my comment with the header β€œ3. JuMP+Xpress JULIA_NUM_THREADS=1 and Xpress THREADS=1”. Would it be informative to run more cases with THREADS=1?

I am using stress-ng and the command to recreate a single threaded Julia container calling a multithreaded subprocess. I will update here when I have some charts to share.

It is interesting that all other examples have allocation done by Julia. Then it makes sense GC no freeing stuff.

What is weird about your problem is that the allocations are done by Xpress/CPLEX. Should they be freeing the memory?

There is no Julia code on threads, right? perhaps callbacks?

Have you already opened issues with FICO/IBM?

When I ran a roughly equivalent problem in Mosel (knapsack.mos above) I did not see any memory growth. We also did not have any memory growth issues when we used Python to call Mosel (via subprocess). It seems like the memory issue is related to threading and garbage collection in Julia, but I am not a computer scientist so this is just a hunch based off of our problem and the others seeing the same weird memory issue. I think that it is not really a β€œleak” per se because the memory use seems to plateau (see my first post). And others have noted the plateau behavior, as well as some improvement by adding GC.gc() (see links above).

I have been trying to find a MWE that does not use any Julia packages, like this one:

Repeated just now with Julia 1.6.1:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

julia> function foo(N)
         Threads.@threads for i in 1:N
foo (generic function with 1 method)

julia> @time foo(10^5)
166.957135 seconds (308.26 k allocations: 745.072 GiB, 54.93% gc time, 0.01% compilation time)

I also tried running foo with GC.gc() after the sum, but I killed it after ~15 minutes. However, this MWE does not lead to memory growth. (But maybe the slow down with GC.gc() is another issue.)


for _ in range(1, stop=100)

where is (an attempt to mimic a solver):

stress-ng -t 30s --matrix 4 --mmap-bytes 6g --mmap 4 --copy-file 2 --copy-file-bytes 1g

I have been trying other combinations of stress-ng settings but I have not found one particular operation that leads to consistent memory growth.

Returning now to knapsack.jl if I run

using Random, JuMP, Xpress
m = Model(()->Xpress.Optimizer(THREADS=5))
@time knapsack(m, 5000, 5000)

I get

61.578035 seconds (344.06 M allocations: 13.692 GiB, 3.08% gc time)

Is the 13.692 GiB problematic @odow ? I don’t know what to expect for allocations and how they might relate to memory consumption growth over iterations.

Calling GC.gc(), can lead to large slowdown indeed, that is even a warning in the docstring.

Since you tried python calling mosel, you could try julia calling mosel with run. What you think?

It would be nice to see this graph:

Including a version with THREADS=1. So that we can compare with xpress w/GC

It might be that THREADS=1 simply allocates less that free number of threads. Then the problem would be GC only.

I’m confused as to what the actual problem is here. As far as I understand it:

  • You want to solve multiple calls to reopt in a single (serial) Julia instance.
  • Each solve uses Xpress, which parallelizes over Xpress.THREADS in the branch-and-bound
  • Overtime, the memory allocated by this Julia instance (which includes the memory allocated by Xpress) increases before plateauing.
  • Sometimes, it hits the docker memory limit and kills the job.

This could be caused by

  • A memory leak of Julia objects
  • A memory leak in Xpress
  • Julia not aggressively freeing memory after a solve

A memory leak in Xpress is unlikely because you saw similar results with Xpress and CPLEX. That they grew at a similar rates suggests it is a Julia issue, but the fact that it plateaued suggests that it is not a memory leak in the sense that things are escaping the GC. So that leaves Julia not being aggressive.

The fact that is a problem, suggests that there is some interrelated aspect of the Xpress finalizer that persists across models. That is a good place to start looking.

There are also things you could try:

  • For hard MIPs, you should expect significant memory allocations due to branch-and-bound. Solvers have a variety of options to set if there is a hard upper-limit. For example;
  • Restart your docker worker every N iterations
  • Ask FICO for support (Xpress.jl is maintained on a voluntary by the community.)