I agree it seems that Julia is not being aggressive on freeing memory.
Maybe its because julia only know about Xpress’ pointer and not about what it points to? (I am as far as you can think of a julia memory management expert)
I would add oscar’s list:
Callling GC sometimes.
Working-out a solution to manually finalize Xpress (We would have to dig into Xpress’ createprob issue, FICO might be of help here)
Run mosel with julia’s run, I believe it will run smoothly. (just eliminate many tests and possibilities you raised)
Regarding using Base.run for solving models: the reason we switched from using Python to Julia is JuMP. Is there a way to still use JuMP with Base.run?
One of the major drawbacks to using Base.run (or Python’s subprocess) is that we have to write all inputs to disk to load them into mosel, as well as have mosel write outputs to disk and read them back in to memory (using Julia or Python). At least, that is how we got ins/outs from mosel in the past. I suppose that there is a better way using the C API (?), but that is what JuMP does!
I have also looked into using kubernetes to restart the Julia pods based off of some metric. Unfortunately I have come up empty handed so far (we are not iterating in the app so there is no distinct counter of problems solved). The only idea that would work is using a cron job to restart the pods every day. However, this is just a band-aid and I would rather figure out the true problem and fix it.
I have not found a reason to contact FICO yet as I have not identified any issue with Xpress. The problem, as you noted, appears to be in Julia.
I will keep pursuing this avenue in parallel.
Just to note: the app is not iterating over problems and is running 12-20 Julia instances. We have no control over how complex any given problem is nor how many problems are run in any given time interval (except for limiting user POST’s). We would like to process as many jobs in parallel as possible to give users the best experience possible. Unfortunately, all solutions so far (using 1 THREAD or limiting the solver’s memory use) will lead to slower jobs, which negates 1,000’s of hours of work spent making the app faster.
My understanding is that in the Xpress “alone” case you were calling is as a subprocess, which opens and completely closes a process right after it finishes.
In terms of julia this would be equivalent to something like:
myscript.jl =
using REoptLite, JuMP, Xpress
m = Model(Xpress.Optimizer)
r = run_reopt(m, "outage.json")
than call as:
for _ in range(1,stop=1000)
run(`julia myscript.jl`)
end
It doesn’t hurt to tell them you would be interested in commercial support for Xpress.jl. They may have suggestions for parameters to set to help with the memory usage.
Here is the chart with THREADS=1 : what can we conclude from it?
That Xpress allocates less memory because it is not parallelizing the branch and bound.
I think you need something like this:
function run_reopt(inputs)
model = direct_model(Xpress.Optimizer())
# ... do stuff
results = get_results(model)
xpress = backend(model)
finalize(xpress)
return results
end
FWIW I thought that there might be a plateau in the average memory usage but I now suspect that the memory usage can continue to grow indefinitely: after increasing our server memory to 128 GB from 64 GB the app just filled it out after a few days
Note that the models run on this server are not controlled (anyone can POST any scenario), so the relatively flat periods of memory use could also be periods of relative inactivity.
@ericphanson mentioned running GC more often as a fix to managing memory usage in containerized Julia sessions; we are running GC.gc() after every JuMP model is solved but the memory usage continues to grow. Is there something else that we should be doing or do we need to wait for the issue mentioned by @jameson to be addressed?
From your picture above (up three), the difference between the blue and red definitely seems like Julia isn’t finalizing the C backend aggressively. This probably explains why CPLEX had something similar.
So that’s part of the problem, but it doesn’t explain the gradual increase even in the red line.
Is this latest (bottom) picture with the manual finalizer? How many solves are called during this time? It really looks like Xpress is allocating something, and then not releasing it when finalized. Are you on the latest version of Xpress?