Optimizing noisy objective

Tamas_Papp · June 13, 2019, 5:39pm

I have a f: \mathbb{R}^n \to \mathbb{R}^m function, m > n, and I am trying to minimize some norm of f(x). The norm itself is flexible, I would like to get something working.

The problem is that f is stochastic as it comes from a simulation. I have tried to reduce variance in the obvious ways (common random variates) where I could, but some remains. The original problem is an indirect inference exercise (in economics).

Any advice on methods and their tuning would be appreciated.

I boiled it down to the following MWE:

const σ = [0.000262; 0.000316; 0.000638; 0.00119; 0.00118;
           0.00126; 0.0014; 0.000233; 0.000243; 0.000291]

const J = [0.23 -0.194 0.672 1.12 -0.0157 -0.0287;
           0.225 -0.187 0.655 1.15 -0.00563 -0.0471;
           0.193 -0.107 0.615 1.24 0.0423 -0.0476;
           -0.0154 0.216 0.141 1.52 0.0216 -0.0649;
           -0.0195 0.186 0.0519 1.76 0.261 -0.0149;
           -0.011 0.184 0.0554 1.96 0.483 0.102;
           0.0042 0.174 0.0502 2.07 0.614 0.132;
           0.0223 0.000494 -0.0088 0.000412 -0.003 0.0623;
           -0.00222 0.0721 -0.0233 -0.00147 -0.00651 0.0187;
           0.0087 0.00169 -0.00492 0.000407 -0.00184 -0.045]

f(x) = J * x .+ (randn(10) .* σ)
g1(y) = sum(abs, y) # norm
g2(y) = sum(abs2, y) # other norm?
lo, hi = fill(-5.0, 6), fill(5.0, 6) # bounds if needed

This is an approximate problem derived from the real one, for the MWE. For the actual f, evaluation takes about 2s.

Here is what I tried so far:

BayesOpt

@jbrea’s excellent BayesOpt gives the best solution:

using BayesOpt
config = ConfigParameters()         # calls initialize_parameters_to_default of the C API
set_kernel!(config, "kMaternARD5")  # calls set_kernel of the C API
config.sc_type = SC_MAP
config.n_iterations = 500
config.init_method = 3
config.n_init_samples = 100
config.n_iter_relearn = 1
optimizer, optimum = bayes_optimization(g1 ∘ f, lo, hi, config)

The L_1 norm seems to prevent premature termination (L_2 sometimes leads to that).

BlackBoxOptim

using BlackBoxOptim
res = bboptimize(g1 ∘ f; SearchRange = tuple.(lo, hi), MaxFuncEvals = 200,
                 method = :dnes)
res = bboptimize(g2 ∘ f; SearchRange = tuple.(lo, hi), MaxFuncEvals = 200,
                 method = :dxnes)

Early termination with not very good values. Some papers suggest that DE and DXNES algorithms should work for these problems, am I tuning them wrong?

CMA-ES

using Evolutionary
cmaes(g1 ∘ f, 6; μ = 2, λ = 4, tol = 1e-5)

terminates somewhat early and does not find a good optimum.

Searching a Sobol grid

Using a (yet unregistered) package I wrote for this:

using SobolSearches
s = sobol_search(g1 ∘ f, lo, hi, 10, 1000)
s[1]

This is fairly robust but I am worried it will not scale with a larger dimension.

dawbarton · June 13, 2019, 5:56pm

How big is n (roughly)? (That is, is your MWE of a similar size since you mentioned being worried about scaling?)

Tamas_Papp · June 13, 2019, 6:30pm

n=6 comes from an actual problem. Eventually n=20 or n=30.

But larger models nest the smaller ones so I can provide reasonable starting values. That said, I am afraid quasirandom search will not scale.

mcreel · June 13, 2019, 7:04pm

This code using the simulated annealing from https://github.com/mcreel/Econometrics/blob/master/src/Optimization/samin.jl (which is the same algorithm as the one in Optim, but with my own interface)
solves the problem much faster than BayesOpt. So far, BayesOpt is still running, after several minutes, and is still considerably above the value that SA found in about one second.

const σ = [0.000262; 0.000316; 0.000638; 0.00119; 0.00118;
           0.00126; 0.0014; 0.000233; 0.000243; 0.000291]

const J = [0.23 -0.194 0.672 1.12 -0.0157 -0.0287;
           0.225 -0.187 0.655 1.15 -0.00563 -0.0471;
           0.193 -0.107 0.615 1.24 0.0423 -0.0476;
           -0.0154 0.216 0.141 1.52 0.0216 -0.0649;
           -0.0195 0.186 0.0519 1.76 0.261 -0.0149;
           -0.011 0.184 0.0554 1.96 0.483 0.102;
           0.0042 0.174 0.0502 2.07 0.614 0.132;
           0.0223 0.000494 -0.0088 0.000412 -0.003 0.0623;
           -0.00222 0.0721 -0.0233 -0.00147 -0.00651 0.0187;
           0.0087 0.00169 -0.00492 0.000407 -0.00184 -0.045]

f(x) = J * x .+ (randn(10) .* σ)
g1(y) = sum(abs, y) # norm
g2(y) = sum(abs2, y) # other norm?
lo, hi = fill(-5.0, 6), fill(5.0, 6) # bounds if needed
x = rand(size(lo,1),1).*(hi - lo) + lo
@time x_sa, f_sa, junk = samin(g1 ∘ f, x, lo, hi, verbosity=1; rt=0.9, ns=20)

using BayesOpt
config = ConfigParameters()         # calls initialize_parameters_to_default of the C API
set_kernel!(config, "kMaternARD5")  # calls set_kernel of the C API
config.sc_type = SC_MAP
config.n_iterations = 500
config.init_method = 3
config.n_init_samples = 100
config.n_iter_relearn = 1
x_bo, f_bo = bayes_optimization(g1 ∘ f, lo, hi, config)

x_sa
f_sa
x_bo
f_bo

aaowens · June 13, 2019, 7:13pm

Maybe something from NLopt? It has many algorithms to choose from.

using NLopt
opt = Opt(:GD_STOGO_RAND, 6)
opt.lower_bounds = lo
opt.upper_bounds = hi
opt.maxeval = 2000
function obj(x, grad)
return (g1 ∘ f)(x)
end 
opt.min_objective = obj
opt.xtol_rel = 1e-4

julia> res = optimize(opt, 0.5hi)
(0.009890825412729495, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], :SUCCESS)

mcreel · June 13, 2019, 8:18pm

Using both samin and the suggested algo in NLopt, samin is finding a better value of the objective, in repeated runs. I’m using the settings from aaowens message. Both are fast, though.

Here’s the output, both algos are solving exactly the same problem:


julia> include("problem.jl")
WARNING: redefining constant σ
WARNING: redefining constant J
************************************************************
SAMIN results
==> Normal convergence <==
total number of objective function evaluations: 65400

     Obj. value:      0.0018255799

       parameter      search width
         0.00264           0.00000 
         0.00208           0.00000 
         0.00124           0.00000 
        -0.00029           0.00000 
        -0.00205           0.00000 
         0.00700           0.00000 
************************************************************
  0.689499 seconds (944.40 k allocations: 72.578 MiB, 3.17% gc time)
optimize(opt, 0.5hi) = (0.00606729899738805, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], :SUCCESS)

aaowens · June 13, 2019, 8:43pm

It looks like STOGO is just settling on 0 actually. If you’re willing to try 20,000 evaluations, one of the other algorithms (controlled random search) in NLopt seems to do as well as yours. The estimates aren’t stable though. Maybe the objective is too random?

opt = Opt(:GN_CRS2_LM, 6)
opt.lower_bounds = lo
opt.upper_bounds = hi
opt.maxeval = 20000
function obj(x, grad)
return (g1 ∘ f)(x)
end 
opt.min_objective = obj
opt.xtol_rel = 1e-4

julia> res = optimize(opt, 0.5hi)
(0.0014034408112579906, [-0.00734686, -0.00323328, 0.0016716, -2.21004e-5, 0.0011075, 0.00270576], :MAXEVAL_REACHED)

ericphanson · June 13, 2019, 9:35pm

This is a very naive idea, but for algorithms that are expecting a deterministic function, maybe some averaging will help? E.g. you could try to minimize the norm of the following f2 instead of f:

using Statistics
f2(x) = mean(f, (x for _ = 1:5))

(I’m no expert though. Maybe this is just too wasteful of those expensive evals).

pistacliffcho · June 13, 2019, 9:37pm

Just a note…

Unless I’m missing something, I think you’re pretty much boxed into using something like BayesOpt because your function is stochastic. Methods like BlackBoxOptim, CMA-ES and Sobol Searches are all deterministic objective functions, unless there’s some variations on these I don’t know about.

As far as I can tell, nothing in the documentation that implies these implementations have been adopted for stochastic objectives.

stevengj · June 13, 2019, 9:48pm

You can implement the Adam algorithm in only a few lines of code, and it often works well for this kind of thing.

mohamed82008 · June 14, 2019, 12:14am

If the MWE is your actual problem, then the 2-norm minimiser of each sample is given by x = J \ b where b = -(randn(10) .* σ), and with some derivations it is possible to show that the expected 2-norm-squared E(norm(f(x), 2)^2) minimizer is given by x = J \ E(b).

Tamas_Papp · June 14, 2019, 7:59am

Unfortunately it isn’t It is a 4k-LOC monster, which is why I made an MWE.

This is of course true, but in practice these methods are also used for (mildly) stochastic objectives.

My understanding is that Adam requires gradients, which I don’t have (and, since it is a discontinuous simulation, I don’t think I can get them in any practical way).

c42f · June 14, 2019, 11:30am

How would you tweak the MWE to reflect this part of the problem?

Tamas_Papp · June 14, 2019, 11:51am

Pretend we don’t have gradients.

mohamed82008 · June 14, 2019, 12:06pm

Check https://github.com/timholy/QuadDIRECT.jl

mohamed82008 · June 14, 2019, 12:16pm

Btw, for black-box stuff 200 function evaluations is too little, I think you need 5k-100k at least to get decent results.

stevengj · June 14, 2019, 12:23pm

You should try to avoid this if you can. Even methods that don’t require gradients often assume “in their hearts” that a gradient exists, and convergence to a local optimum is typically much slower for non-smooth functions even if the method still converges. There are lots of tricks to take problems that are nominally discontinuous and reformulate them in a smooth way, e.g. epigraph formulations.

Tamas_Papp · June 14, 2019, 12:27pm

This is indirect inference using simulated data. Essentially

function objective(parameters, targets)
    data = f([rng], parameters)
    moments = g(data)
    sum(abs2, targets, moments)
end

with frills. f can have branches etc, depending on random draws. AFAIK methods to deal with this are WIP, eg

@article{bruins2017generalized,
  title={Generalized Indirect Inference for Discrete Choice Models},
  author={Bruins, Marianne and Duffy, James A and Keane, Michael P and Smith Jr, Anthony A},
  year={2017}
}

but I am always happy to hear suggestions.

jonathanBieler · June 14, 2019, 1:35pm

According to :

https://coco.gforge.inria.fr/doku.php?id=higlighted_results_2012_all_noisy_functions

The best algorithm on noisy functions is some variant of CMAES, with xNES being a bit behind. It’s a pretty limited benchmark, but I haven’t found anything better. The CMAES version in Evolutionary is the most basic one, so I don’t know how well it’s supposed to perform here. That said if it converges too fast you can increase the population size (e.g λ = 16, μ = 8).

robsmith11 · June 14, 2019, 2:15pm

I often have similar problems… calibration of expensive stochastic simulations (seconds to minutes per evaluation) with 5-15 parameters and highly noisy objectives.

Bayesian optimization is the most sample-efficient approach in theory, and it works pretty well for me when the dimensions aren’t too high and I can use reasonable priors for objective smoothness. With some objectives, the heterogeneity of the noise can be a problem. This R package helps in those situations:

The Cross Entropy Method worked pretty well for me as well when the dimensions get higher and noise lower:

I’ve also started looking at Hyperband/BOHB, but I think that may only be appropriate for situations where there is an expensive training step and a fast objective evaluation.

Topic		Replies	Views
Latest recommendations for global optimization Optimization (Mathematical) package	92	13192	August 23, 2022
Which optimization package should I use? Optimization (Mathematical) optimization , nonlinear	13	3146	May 1, 2021
MAP optimization in Turing with LBFGS doesn't move at all Probabilistic Programming optimization , turing , autodiff	22	268	August 23, 2024
Choosing an optimizer Optimization (Mathematical) question	10	454	January 28, 2024
Optim: What optimiser is best if your gradient computation is slow? Optimization (Mathematical) optim	13	3151	August 22, 2017

Optimizing noisy objective

BayesOpt

BlackBoxOptim

CMA-ES

Searching a Sobol grid

Related topics