I am trying to run a large nonlinear optimization problem on a HPC cluster, where I can only get 10 consecutive days of computing before my code is aborted due to the time limit in the cluster. As my problem involves 19 variables and an objective function that takes just over an hour to compute one value, this is not enough, so I print the trace of the optimization algorithm and try to get Optim to start where it left off in the previous run. The problem is, I believe this doesn’t preserve the gradient information computed by the algorithm.
Would autodiff solve this problem? Does autodiff need many different evaluations to estimate the initial gradient like finite differences seems to need (two per parameter judging from the output)? And if not, is there a way to load gradient information from the optimization trace?
To my knowledge there is no dedicated functionality to do this. However, I also had a similar issue and looking at the code you can see how the initial state is built (here for instance for the BFGS algorithm).
Then you can call Optimize with a modified initial state as so:
opt = Optim.optimize(fun, init, method, opts, oldstate)
There’s a few things that aren’t clear:
- You’re talking about a large problem, but it has only 19 variables. Are you running a local method (gradient-based) or a global method?
- Why are you trying to “start Optim where it left off”? Do you mean between 2 groups of 10 consecutive days?
This seems to be the right direction! So if I understand you correctly, I can try to print out the inverse Hessian
invH and the stepnorm
s and then supply these as intitial conditions? Or do you recommend creating a full BFGSState object? Do you know how I can obtain/save this state information at every iteration? And would you happen to know whether this all still functions within Fminbox?
The size of the problem does not hide in the number of variables (I agree that 19 variables isn’t ridiculously many) but in the number of simulations and approximations required to obtain the value of the objective function related to a single choice of variables, which means that every function evaluation takes just over one hour after parallelisation. Thus computing two function evaluations for every dimension to estimate a starting value of the gradient from finite differences is already two days of computing time.
I indeed mean 2 groups of 10 consecutive days, as the HPC cluster I’m using is a shared resource with a maximum runtime of 10 days for any script.
If it takes such a long time, are you sure Optim.jl is really the right approach?
I was working under the assumption that the optimization routines of Optim.jl were probably implemented faster and more robustly than anything I could cook up by myself. Would you instead recommend writing my own optimization routine, or else is there another package more suited for optimization with very slow objective functions?
Well it sounds to me like you are doing simulated method of moments or something similar. In that case, I would guess you would want to use a global optimizer instead of a local one unless you have guarantees that your initial point is “good enough” or that your problem is convex (but usually in SMM we do not have these guarantees — again guessing here).
1 hour is an awfully long time for a single objective function evaluation. I think it could be worthwhile for you to look into using GitHub - SciML/Surrogates.jl: Surrogate modeling and optimization for scientific machine learning (SciML) to build a surrogate (“approximating”) model to your objective function and then optimize surrogate. After that, you use that estimate as the initial point to optimize your true objective function.
Alternatively, you could look into something like the TikTak algorithm which was explicitly designed for this type of parallelism.
Oh wow this sounds like a really good find, thank you very much, I will look into it!
I’m indeed using a form of “quasi-SMM”, as my simulations are aggregated from quadratures and quasi-Monte Carlo low discrepancy grids (Sobol) which I am already computing in parallel, but if I understand correctly Surrogates can accommodate that type of structure.
I was using local optimization with box constraints because my parameter values have a rather strict interpretation, implying they should not wander too far from their “common” values reported in the literature in the first place, so this is why I’d actually prefer a decent local optimum in the neighbourhood of the intial point than an infeasible global optimum. But this can also be achieved through sufficiently tight constraints on a global optimizer.
Local optimization may be too slow to converge even with decent starting values for the parameters from Surrogates.jl due to the gradient computations taking so long. If I understand correctly I can then use TikTak for that final step of the process, correct?
No, TikTak is a global optimizer in its own right, you wouldn’t use it to polish anything off. The first step in the process is to compute your objective function at a large number of Sobol points! If you have 19 variables, you’d probably want at least 100 starting points I’d guess (I would have said 1000 but no way with 1hr per evaluation).
Given what you’ve said about your setup, I guess box constrained local is probably alright. However, I’m still wondering how feasible it is given that you have 19 variables. How many iterations do you expect it should take to reach an optimum? Even with a crude tolerance, I can’t imagine it being less than several hundred.
I have the time to estimate around 1000-2000 iterations in total before I need to deliver some results. I could try to use some of those (say 200) to construct a grid for estimating a surrogate polynomial and minimise this for a better starting value, and then start the local estimation from there, or I could continue with local optimization. This still leaves the question of how to save the “state information” for Optim between consecutive runs, but I suppose I can either try hand-cook the BFGS routine to store a backup of its state-information in an accessible location or hope @touste has some insights
This seems a bit more tricky with
fminbox as the initialization is hard coded in the optimize call: https://github.com/JuliaNLSolvers/Optim.jl/blob/master/src/multivariate/solvers/constrained/fminbox.jl#L322
So I don’t think
fminbox accepts initial states.
As for other algorithms that accept an initial state in the call function, what I would do is save all the relevant fields (gradients, hessians, …) of the initial state structure (using
JLD2.jl for instance), and then restore them in the next run.
Note that a dirty trick you may use is to overload the call to
initial_state to provide your own:
Optim.initial_state(_optimizer, options, dfbox, x) = load("oldstate.jld2")
These were exactly the dirty tricks I was thinking of, so I’m very happy to get your seal of approval for them