Optimizing noisy objective

I have a f: \mathbb{R}^n \to \mathbb{R}^m function, m > n, and I am trying to minimize some norm of f(x). The norm itself is flexible, I would like to get something working.

The problem is that f is stochastic as it comes from a simulation. I have tried to reduce variance in the obvious ways (common random variates) where I could, but some remains. The original problem is an indirect inference exercise (in economics).

Any advice on methods and their tuning would be appreciated.

I boiled it down to the following MWE:

const σ = [0.000262; 0.000316; 0.000638; 0.00119; 0.00118;
           0.00126; 0.0014; 0.000233; 0.000243; 0.000291]

const J = [0.23 -0.194 0.672 1.12 -0.0157 -0.0287;
           0.225 -0.187 0.655 1.15 -0.00563 -0.0471;
           0.193 -0.107 0.615 1.24 0.0423 -0.0476;
           -0.0154 0.216 0.141 1.52 0.0216 -0.0649;
           -0.0195 0.186 0.0519 1.76 0.261 -0.0149;
           -0.011 0.184 0.0554 1.96 0.483 0.102;
           0.0042 0.174 0.0502 2.07 0.614 0.132;
           0.0223 0.000494 -0.0088 0.000412 -0.003 0.0623;
           -0.00222 0.0721 -0.0233 -0.00147 -0.00651 0.0187;
           0.0087 0.00169 -0.00492 0.000407 -0.00184 -0.045]

f(x) = J * x .+ (randn(10) .* σ)
g1(y) = sum(abs, y) # norm
g2(y) = sum(abs2, y) # other norm?
lo, hi = fill(-5.0, 6), fill(5.0, 6) # bounds if needed

This is an approximate problem derived from the real one, for the MWE. For the actual f, evaluation takes about 2s.

Here is what I tried so far:

BayesOpt

@jbrea’s excellent BayesOpt gives the best solution:

using BayesOpt
config = ConfigParameters()         # calls initialize_parameters_to_default of the C API
set_kernel!(config, "kMaternARD5")  # calls set_kernel of the C API
config.sc_type = SC_MAP
config.n_iterations = 500
config.init_method = 3
config.n_init_samples = 100
config.n_iter_relearn = 1
optimizer, optimum = bayes_optimization(g1 ∘ f, lo, hi, config)

The L_1 norm seems to prevent premature termination (L_2 sometimes leads to that).

BlackBoxOptim

using BlackBoxOptim
res = bboptimize(g1 ∘ f; SearchRange = tuple.(lo, hi), MaxFuncEvals = 200,
                 method = :dnes)
res = bboptimize(g2 ∘ f; SearchRange = tuple.(lo, hi), MaxFuncEvals = 200,
                 method = :dxnes)

Early termination with not very good values. Some papers suggest that DE and DXNES algorithms should work for these problems, am I tuning them wrong?

CMA-ES

using Evolutionary
cmaes(g1 ∘ f, 6; μ = 2, λ = 4, tol = 1e-5)

terminates somewhat early and does not find a good optimum.

Searching a Sobol grid

Using a (yet unregistered) package I wrote for this:

using SobolSearches
s = sobol_search(g1 ∘ f, lo, hi, 10, 1000)
s[1]

This is fairly robust but I am worried it will not scale with a larger dimension.

4 Likes

How big is n (roughly)? (That is, is your MWE of a similar size since you mentioned being worried about scaling?)

n=6 comes from an actual problem. Eventually n=20 or n=30.

But larger models nest the smaller ones so I can provide reasonable starting values. That said, I am afraid quasirandom search will not scale.

This code using the simulated annealing from https://github.com/mcreel/Econometrics/blob/master/src/Optimization/samin.jl (which is the same algorithm as the one in Optim, but with my own interface)
solves the problem much faster than BayesOpt. So far, BayesOpt is still running, after several minutes, and is still considerably above the value that SA found in about one second.

const σ = [0.000262; 0.000316; 0.000638; 0.00119; 0.00118;
           0.00126; 0.0014; 0.000233; 0.000243; 0.000291]

const J = [0.23 -0.194 0.672 1.12 -0.0157 -0.0287;
           0.225 -0.187 0.655 1.15 -0.00563 -0.0471;
           0.193 -0.107 0.615 1.24 0.0423 -0.0476;
           -0.0154 0.216 0.141 1.52 0.0216 -0.0649;
           -0.0195 0.186 0.0519 1.76 0.261 -0.0149;
           -0.011 0.184 0.0554 1.96 0.483 0.102;
           0.0042 0.174 0.0502 2.07 0.614 0.132;
           0.0223 0.000494 -0.0088 0.000412 -0.003 0.0623;
           -0.00222 0.0721 -0.0233 -0.00147 -0.00651 0.0187;
           0.0087 0.00169 -0.00492 0.000407 -0.00184 -0.045]

f(x) = J * x .+ (randn(10) .* σ)
g1(y) = sum(abs, y) # norm
g2(y) = sum(abs2, y) # other norm?
lo, hi = fill(-5.0, 6), fill(5.0, 6) # bounds if needed
x = rand(size(lo,1),1).*(hi - lo) + lo
@time x_sa, f_sa, junk = samin(g1 ∘ f, x, lo, hi, verbosity=1; rt=0.9, ns=20)

using BayesOpt
config = ConfigParameters()         # calls initialize_parameters_to_default of the C API
set_kernel!(config, "kMaternARD5")  # calls set_kernel of the C API
config.sc_type = SC_MAP
config.n_iterations = 500
config.init_method = 3
config.n_init_samples = 100
config.n_iter_relearn = 1
x_bo, f_bo = bayes_optimization(g1 ∘ f, lo, hi, config)

x_sa
f_sa
x_bo
f_bo

Maybe something from NLopt? It has many algorithms to choose from.

using NLopt
opt = Opt(:GD_STOGO_RAND, 6)
opt.lower_bounds = lo
opt.upper_bounds = hi
opt.maxeval = 2000
function obj(x, grad)
return (g1 ∘ f)(x)
end 
opt.min_objective = obj
opt.xtol_rel = 1e-4

julia> res = optimize(opt, 0.5hi)
(0.009890825412729495, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], :SUCCESS)

Using both samin and the suggested algo in NLopt, samin is finding a better value of the objective, in repeated runs. I’m using the settings from aaowens message. Both are fast, though.

Here’s the output, both algos are solving exactly the same problem:


julia> include("problem.jl")
WARNING: redefining constant σ
WARNING: redefining constant J
************************************************************
SAMIN results
==> Normal convergence <==
total number of objective function evaluations: 65400

     Obj. value:      0.0018255799

       parameter      search width
         0.00264           0.00000 
         0.00208           0.00000 
         0.00124           0.00000 
        -0.00029           0.00000 
        -0.00205           0.00000 
         0.00700           0.00000 
************************************************************
  0.689499 seconds (944.40 k allocations: 72.578 MiB, 3.17% gc time)
optimize(opt, 0.5hi) = (0.00606729899738805, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], :SUCCESS)

It looks like STOGO is just settling on 0 actually. If you’re willing to try 20,000 evaluations, one of the other algorithms (controlled random search) in NLopt seems to do as well as yours. The estimates aren’t stable though. Maybe the objective is too random?

opt = Opt(:GN_CRS2_LM, 6)
opt.lower_bounds = lo
opt.upper_bounds = hi
opt.maxeval = 20000
function obj(x, grad)
return (g1 ∘ f)(x)
end 
opt.min_objective = obj
opt.xtol_rel = 1e-4

julia> res = optimize(opt, 0.5hi)
(0.0014034408112579906, [-0.00734686, -0.00323328, 0.0016716, -2.21004e-5, 0.0011075, 0.00270576], :MAXEVAL_REACHED)

This is a very naive idea, but for algorithms that are expecting a deterministic function, maybe some averaging will help? E.g. you could try to minimize the norm of the following f2 instead of f:

using Statistics
f2(x) = mean(f, (x for _ = 1:5))

(I’m no expert though. Maybe this is just too wasteful of those expensive evals).

Just a note…

Unless I’m missing something, I think you’re pretty much boxed into using something like BayesOpt because your function is stochastic. Methods like BlackBoxOptim, CMA-ES and Sobol Searches are all deterministic objective functions, unless there’s some variations on these I don’t know about.

As far as I can tell, nothing in the documentation that implies these implementations have been adopted for stochastic objectives.

You can implement the Adam algorithm in only a few lines of code, and it often works well for this kind of thing.

1 Like

If the MWE is your actual problem, then the 2-norm minimiser of each sample is given by x = J \ b where b = -(randn(10) .* σ), and with some derivations it is possible to show that the expected 2-norm-squared E(norm(f(x), 2)^2) minimizer is given by x = J \ E(b).

Unfortunately it isn’t :wink: It is a 4k-LOC monster, which is why I made an MWE.

This is of course true, but in practice these methods are also used for (mildly) stochastic objectives.

My understanding is that Adam requires gradients, which I don’t have (and, since it is a discontinuous simulation, I don’t think I can get them in any practical way).

How would you tweak the MWE to reflect this part of the problem?

Pretend we don’t have gradients.

1 Like

Check https://github.com/timholy/QuadDIRECT.jl

Btw, for black-box stuff 200 function evaluations is too little, I think you need 5k-100k at least to get decent results.

You should try to avoid this if you can. Even methods that don’t require gradients often assume “in their hearts” that a gradient exists, and convergence to a local optimum is typically much slower for non-smooth functions even if the method still converges. There are lots of tricks to take problems that are nominally discontinuous and reformulate them in a smooth way, e.g. epigraph formulations.

This is indirect inference using simulated data. Essentially

function objective(parameters, targets)
    data = f([rng], parameters)
    moments = g(data)
    sum(abs2, targets, moments)
end

with frills. f can have branches etc, depending on random draws. AFAIK methods to deal with this are WIP, eg

@article{bruins2017generalized,
  title={Generalized Indirect Inference for Discrete Choice Models},
  author={Bruins, Marianne and Duffy, James A and Keane, Michael P and Smith Jr, Anthony A},
  year={2017}
}

but I am always happy to hear suggestions.

According to :

https://coco.gforge.inria.fr/doku.php?id=higlighted_results_2012_all_noisy_functions

The best algorithm on noisy functions is some variant of CMAES, with xNES being a bit behind. It’s a pretty limited benchmark, but I haven’t found anything better. The CMAES version in Evolutionary is the most basic one, so I don’t know how well it’s supposed to perform here. That said if it converges too fast you can increase the population size (e.g λ = 16, μ = 8).

1 Like

I often have similar problems… calibration of expensive stochastic simulations (seconds to minutes per evaluation) with 5-15 parameters and highly noisy objectives.

Bayesian optimization is the most sample-efficient approach in theory, and it works pretty well for me when the dimensions aren’t too high and I can use reasonable priors for objective smoothness. With some objectives, the heterogeneity of the noise can be a problem. This R package helps in those situations:
https://cran.r-project.org/web/packages/hetGP/index.html

The Cross Entropy Method worked pretty well for me as well when the dimensions get higher and noise lower:

I’ve also started looking at Hyperband/BOHB, but I think that may only be appropriate for situations where there is an expensive training step and a fast objective evaluation.

1 Like