JuMP optimize! call will very rarely violate model's time limit parameters by huge factors?

I am doing research that involves incrementally loosening the relative parameter setting of a MIPS optimizer, in order to enable faster termination. Normally, this strategy works quite well. However, very rarely a call to optimize! will take several times more time than the time limit should allow. A small violation of the time limit, on the order of a few seconds, is expected behavior from both CPLEX and Gurobi, the two optimizers I have tried this with. However, this massive violation is unacceptable, and has been observed using both optimizers.

JuliaOptHeuristics/new_trial.xlsx at 8180515770e990515b0e1162bea53de573448c66 · Antelling/JuliaOptHeuristics · GitHub is an excel file containing data that demonstrates this problem.

I generate these results on a large university compute server, while recording the load of the server. Although I previously had a problem with a memory leak in JuMP/Julia (this issue, which I solve by manually calling GC.gc()) filling up the server’s RAM, now the server is never at the limits of its specs. CPU usage is around 15% of capacity, RAM usage is around 50%, nothing else is high.

This issue only seems to occur with high time limits (around five minutes per optimize! call). I get good, expected, error free results with smaller time limits.

an excel file containing data that demonstrates this problem.

I don’t understand the file. Try to make a reproducible example of a model that exceeds the time limit.

You can create an MPS file using

write_to_file(model, "model.mps")

Then, we should see the problem if you go

using JuMP, Gurobi
model = read_from_file("model.mps")
set_optimizer(model, optimizer_with_attributes(Gurobi.Optimizer, "TimeLimit" => 60.0))
optimize!(model)

However, this isn’t really a JuMP or Julia issue. Time limits are passed to the underlying solver (Gurobi or CPLEX). You should contact their support teams for more information, although they will probably ask you for the same reproducible example.

1 Like

You should also read this post. Someone had a similar question recently: Setting Time Limits for JuMP (with CBC solver) - #6 by odow

1 Like

Unless both of the underlying solvers have this same, bizarre bug, it is an issue with JuMP or Julia. As the issue occurs randomly with a very low probability, I have not been able to distill the situation into a repeatable example. Whenever an error occurs, I save the model and a copy of the model from before the optimize! call. Calling optimize! again on either of these saved models does not reproduce the problem.

I am aware of how to set the time limit parameters appropriately for the underlying solver.

JuMP doesn’t have any control over when solvers return from a time limit, so assuming we pass the time limit correctly to the solver (which it sounds like we do in the other cases), this is not a bug in JuMP.

It’s hard to say more without more information.

  • How are you measuring the time elapsed? (e.g., are you computing anything after optimize! that is also counted.)
  • What were the time limits you set?
  • What counts as over the time limit?
  • Does the solver’s log of a solve that exceeded the time limit offer any insight?

The most likely explanation is that this is not a bug, but expected behavior of both solvers. Setting a time limit asks the solver to terminate at the first available opportunity after the time limit. If they are in the middle of some expensive operation, such as solving the root node, they may decide to finish solving the root node before terminating.

Unfortunately, the saved model may be different from the model in memory in small ways that can make a difference. The best way to have a reproducible result is:

  1. Save the model before solving it, throw away the model in memory, load the saved model and solve it instead. In other words, guarantee that you are solving a model that already has the particularities introduced by saving and loading it back.
  2. Always set the solver seed to the same value.
  3. Do not use multiple threads.

Even so, as you are dealing with time, small things like different temperature can make your computer to underclock or not, giving different results.

2 Likes

Try applying some of Julia’s profiling tools to see where the time is spent. Are you querying the solve time by solve_time or some other means?

I am measuring time elapsed by saving time() into a variable, calling optimize!, saving time() into a variable again on the very next line, and then subtracting the two time values. Time limits set vary from 120 seconds to 400 seconds. Any time the solver reaches the time limit, it will be a few seconds over, such as 305 seconds elapsed instead of the 300 second time limit. This is not a problem. The problematic time limit violations take thousands of seconds. 5000 seconds is not uncommon, and the worst I have seen in 9000.

I am trying henrique_beckers suggestions now.

1 Like

I have to say I never had this kind of time limit violations, and I do execute CPLEX and Gurobi over hundreds of problems to make each scientific paper I write (so I check if the times were respected closely). However, I often use longer time limits, as one to three hours. I suppose if you use the barrier algorithm for the root node, and each iteration is very slow, then maybe the solver do not stop in the middle of an iteration.

Do you have the solver logs? If I am not wrong both CPLEX and Gurobi have a configuration in which they print which options have non-default values (so, you could check which time limit they are getting from JuMP). More than that, the log often shows how much time was spent in each step (root node solving, heuristic search, tree search, etc…) and it is printed in the terminal even if the code stops by the time limit. This would help to check if the problem is, for example, that the solver you use refuses to let go until it has solved the root node relaxation or something like that.

An output log is here. It is a saved log of CPLEX’s output to the terminal.

Lines 37463, 42561, 53874, 74387, 113944, and more all contain examples of the issue.

Looking at the last instance you cite. If you just extract the “Elapsed time” rows, it’s running along at ~10k ticks / 10 seconds. (A tick is a deterministic CPLEX found of operations.)

Elapsed time = 20.57 sec. (22256.93 ticks, tree = 2183.26 MB, solutions = 15)
Elapsed time = 29.07 sec. (31794.49 ticks, tree = 2188.91 MB, solutions = 15)
Elapsed time = 37.56 sec. (41331.68 ticks, tree = 2195.69 MB, solutions = 15)
Elapsed time = 46.06 sec. (50869.27 ticks, tree = 2199.64 MB, solutions = 15)
Elapsed time = 54.49 sec. (60406.55 ticks, tree = 2204.01 MB, solutions = 15)
Elapsed time = 2121.55 sec. (69943.88 ticks, tree = 2211.33 MB, solutions = 15)

Then in the last row, we do another 10k ticks but this time it takes 2000 seconds? That suggests it might be something with your computer. Do you have a job that runs periodically and consumes all the resources? (e.g., a virus scan?) That would explain what it stalls both CPLEX and Gurobi.

Reduced MIP has 31 rows, 281 columns, and 7781 nonzeros.
Reduced MIP has 250 binaries, 31 generals, 0 SOSs, and 0 indicators.

These are also small problems. The solve times don’t really make sense.

Is the formulation bad? I think you need to provide the code of the model.

1 Like

JuliaOptHeuristics/MDMKP.jl at f913fe28d026e045fb2f717457083999ff5a6b98 · Antelling/JuliaOptHeuristics · GitHub is the model creation code

I guess I really meant a minimal working example (or better yet, an MPS file of a problem that stalls).

What machine are you running on? Does it happen if you use a different computer? Do you have the Gurobi log as well?

1 Like

The machine is my university’s compute server with an Intel(R) Xeon(R) CPU E5-2640 v3 processor, 32 GiB of RAM and CentOS Linux 7. I do not have access to another machine besides my laptop. I suppose I could modify the code to be able to run experiments incrementally, and then start and stop the code overnight, but it would take weeks to get one experiment to complete.

A Gurobi log is https://raw.githubusercontent.com/Antelling/JuliaOptHeuristics/892758eec4cf20b415a87e6e310b5b1c207a7c79/out.log
Line 10360 (ctrl-F “18643”) is I think the only example of the problem, I killed the Gurobi trial once it demonstrated the same problem.

If CPLEX’s tick metric does not have the same issue, can I simply do statistical analysis on that? What is the equivalent to MOI.get(model, MOI.SolveTime()) for ticks?

Hmm. This line is worrying:

User-callback calls 813366, time in user-callback 18118.75 sec

It suggests that the problem might be a Julia bug in the C -> Julia -> C callbacks. Are you adding any callbacks? Or is this just the default one we add so you can CTRL+C interrupt it?

I am not adding any callbacks.

So the Gurobi issue is likely the callback thing. I’ve opened a PR to allow disabling the default callback: Allow enable_interrupts=false by odow · Pull Request #403 · jump-dev/Gurobi.jl · GitHub

But CPLEX doesn’t have a callback, so it must be a different problem.

What is the equivalent to MOI.get(model, MOI.SolveTime()) for ticks?

There is no equivalent at the MOI level, you’d have to use the low-level C API:
https://www.ibm.com/docs/en/icos/20.1.0?topic=g-cpxxgetdettime-cpxgetdettime

If the error is caused by the machine I am using, couldn’t this also occur during the user callback?