(It is also, of course, horrible to have to rewrite the function that way. That’s what we’re trying to solve using Julia.)
Sorry for the string of replies, but actually that function has lots of global minima. IntervalOptimisation is trying to find all of them. It is unfortunately a known property of these methods (the “cluster effect”) that they find it difficult to discard boxes that are close to a minimum.
My own way to use SA in real life is to run it twice or more, to verify that I get the same result. If different results are obtained, then the cooling is too fast.
There’s really no need for multiple starts with SA, I would say, because SA explores the initially specified box fairly thoroughly, if cooling is not too fast. I sometimes use a rapid SA with a low limit on function evaluations to obtain start values for MCMC, as a handy substitute for a more systematic way of initializing. With regard to combining algorithms, one could combine a more systematic grid over the box, and then switch to SA with faster cooling. The possibilities are endless! (which is part of the problem, of course).
Are you sure about this? It was intended to have only one. Each f0 ∘ abs
looks like
so the only global minimum should be at 0
. But, of course, it has 2 extra local minima for each axis.
Thanks for the figure. I had defined the interval version wrong.
Here’s take 2:
"Evaluate interval function with piecewise region--function pairs"
function piecewise(pieces::Vector{<:Pair}, X)
return reduce(∪, f(X ∩ region) for (region, f) in pieces)
end
X = 2..5
f2 = X -> piecewise([0..3 => x->x+1,
3..6 => x->x,
6..∞ => x->x+2], X)
f(x) = sum(f2 ∘ abs, x)
B = IntervalBox(-10..10, 10)
julia> @time minimum, minimisers = minimise(f, B, tol=1e-6);
0.011364 seconds (198.56 k allocations: 7.963 MiB)
julia> minimum
[10, 10]
julia> minimisers
1-element Array{IntervalBox{10,Float64},1}:
[-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07] × [-5.98024e-08, 5.12973e-07]
It wouldn’t (I think) be too difficult to extend this to piecewise functions where the range was not given explicitly as here, but rather by a condition like x^2 + y^2 \le 1.
The conversion of if
s to piecewise functions like this is trickier…
EDIT: Julia + intervals makes defining piecewise functions so beautiful!
Is there a way to limit the number of function evaluation in IntervalOptimisation ?
Not yet, no - but if you mean just putting a maximum and bailing if you reach that maximum, that should be easy. Would you like to make a PR?
Possibly I was not clear: my point above was that it is (practically) impossible for simulated models that have branches and random numbers.
Consider a setup where 100000 agents are simulated for 20 years, each of which has one branch. Moments then are obtained from simulated data. That’s 2^{20 \cdot 10^5} combinations — you see where this is going.
Don’t get me wrong — I find interval methods amazing and elegant. I think they have a place and should be known more widely; eg I am now using them in some other code to solve for some equilibrium prices. But there are still a lot of problems where robust global solvers based on heuristics and randomization are the best choice.
If you lack differentiability and continuity, you really shouldn’t be using any algorithm that in its heart, takes a derivative. This includes algorithms like NEWUOA, BOBYQA, and COBYLA, which internally fit data to a differentiable model.
If you are trying to solve an optimization problem that lacks differentiability, usually your best course of action is to think more deeply about the problem and see whether you can reformulate it in a differentiable way, e.g. by using an epigraph formulation.
For global optimizations of smooth functions in < 10 dimensions, usually the MLSL multistart algorithm is my first choice, though like most multistart algorithms it requires some problem-specific experimentation to choose a good stopping criterion for the local optimizer (since you don’t want to waste too much time doing local optimization to high precision … you can always “polish” the best local optimum at the end). MLSL doesn’t handle nonlinear constraints, however.
@mcreel, what are the roles of nt
and ns
in your SAMIN()
code? I didn’t find the documentation to be very clear here so I am hoping you could help clarify since you changed to default settings of ns
(from 5 to 3) in the the default SAMIN()
code posted below
SAMIN(; nt::Int = 5 # reduce temperature every nt*ns*dim(x_init) evaluations
ns::Int = 5 # adjust bounds every ns*dim(x_init) evaluations
rt::T = 0.9 # geometric temperature reduction factor: when temp changes, new temp is t=rt*t
neps::Int = 5 # number of previous best values the final result is compared to
f_tol::T = 1e-12 # the required tolerance level for function value comparisons
x_tol::T = 1e-6 # the required tolerance level for x
coverage_ok::Bool = false, # if false, increase temperature until initial parameter space is covered
verbosity::Int = 0) # scalar: 0, 1, 2 or 3 (default = 0).
nt and ns jointly determine how many trials are done at a given tempurature. ns controls how many trials are done before the bounds are adjusted to keep the acceptance rate within limits, while the temperature is at a given value. I used non-default values somewhat by accident, I was experimenting more with rt, which is a more important control. I recommend having a look at the papers cited in the help documentation to get a better feel for the different controls.
The default values are, in my opinion, fairly conservative, for reasonably well behaved objective functions, as might be encountered in economics, for example, where we don’t usually expect extremely nonlinear behavior. For very difficult problems, such as the Rastrigin, the defaults don’t give good results.
@mcreel, Thanks, yes that makes sense. How does this play out with highly dimensional functions though? Let’s say I have a function f: \mathbb{R}^n \rightarrow \mathbb{R}, where n = 100. Then, setting nt = ns = 1
would mean that temperature is adjusted once every nt*ns*dim(x_init) = 100
evaluations so we will need to set the number of function evaluations to be somewhat large to get some idea of how quickly the SAMIN()
algorithm is closing in to the optimizer. The bounds will also be adjusted also per 100 evaluations but I thought that the implementation of SAMIN()
in Julia now allows users to set upper and lower bounds for the parameter values. What does " bounds being adjusted to keep the acceptance rate within limits" mean then in a context where a user can pre-set the upper and lower bounds of the parameter values in SAMIN()
?
I’ll definitely check out the original paper, thank you again for the helpful reply! Including a link to the original paper in case anyone is interested
https://www.sciencedirect.com/science/article/abs/pii/0304407694900388
This made me think again about an optimization problem I am struggling with since a while.
The objective function has the form \displaystyle f(x) = \sum_{z_1,\ldots,z_n} g(x, z_1, \ldots, z_n) where z_i are discrete random variables such that the sum would easily run over more than 10^{10} different states and x\in R^m with m around 100. The function g(x, z_1, \ldots, z_n) is not differentiable in x, because of the interactions with the discrete random variables. (In fact, x are parameters of a stochastic dynamical system). Since it is easy to get Monte Carlo estimates \hat f(x), I use currently these noisy estimates and CMA-ES to approximately find a local minimum; CMA-ES turned out to work better than many alternatives that I’ve tried.
But maybe there is a nice refomulation of this optimization problem that I’ve overlooked. Any ideas?
I don’t think you’d want to spend a lot of energy in variance-reduction techniques since there’ll be noise in any estimate.
You’re better off using optimization methods that are “robust to noise”. There has been some progress on Stochastic Gradient Descent for discrete state spaces so I’d suggest checking out those papers.
Maybe the central limit theorem lets you replace these with continuous random variables in some way to get nearly the same average? However, even with continuous random variables, with m=100 you have a 100-dimensional integral inside your objective function, so there will need to be some very special property of g(x,z)
for that integrable to be tractable.
@stevengj Assuming the distribution he is integrating over is independent, why would the dimension matter here if he is using something like Monte-Carlo to simulate the integral? Wouldn’t the variance of the estimator be independent of the dimension but depend only of the number of simulations?
If I am not wrong, he is already using MC methods to compute \hat{f(x)}.
Monte-Carlo’s convergence rate of O(\sigma/\sqrt{n}) is terribly slow to begin with, and methods to accelerate it (e.g. quasi Monte-Carlo, importance sampling, …) often depend explicitly on the dimension (see e.g. this review). Second, Monte-Carlo’s convergence rate depends on the variance \sigma, and the variance is not necessarily independent of dimension — how do we know it is independent for the poster’s problem?
Also, if you can exploit some property of the integrand to get a deterministic estimate (not Monte-Carlo), so that the objective function is smooth in the parameters, it will make the optimization problem much easier.
I suspect that paying careful thought to how the objective function is formulated and computed will be the main way to make progress in @jbrea’s problem. In general, my experience is that few optimization specialists spend much time developing new algorithms per se. Most of the cleverness in optimization lies in finding a way to take a problem and reformulate or transform it in some way so as to exploit the most efficient existing algorithms.
@stevengj hmm, yes I was talking about a fixed \sigma.
I agree with your suggestion. Once you make the objective smooth in the parameters, you open yourself up to standard SGD methods which are robust to noise in the estimates and relatively easy to code. Hence, one can spend less energy on getting extremely precise estimates (variance-reduction) and focus on function optimization. One still needs to tune parameters like learning rates but there is substantial literature on it to get started.
Thanks a lot for the inputs @stevengj and @bashonubuntu! I will think about deterministic estimates, but I am skeptical that the central limit theorem could come to rescue here…
@bashonubuntu: it would be nice, if I could use SGD. But so far I did not yet see a way to formulate the problem such that I can take gradients. Basically, g(x, z_1, \ldots, z_n) can be further decomposed using the recursion s_i = h(x, z_i, s_{i-1}), i.e. g(x, z_1, \ldots, z_n) = h\Big(x, z_n, h\big(x, z_{n-1}, h(x, z_{n-2}, \ldots)\big)\Big), but since s_i is discrete, I can’t just compute partial derivatives.
@dpsanders I a made quick benchmark of IntervalOptimisation (using BlackBoxOptimizationBenchmarking.jl) against one of the best algorithm of BlackBoxOptim. I used 12 test functions and 6 dimensions. IntervalOptimisation has some really good results (in 3D it was even better) but it’s also very slow, more than 100x slower than a simplex, which makes benchmarking it further on my poor laptop a bit painful.