Optimizing noisy objective

An update: I have programmed a version of SPSA and indeed it is working fine (after some tuning). Thanks, @dlakelan! Of course I will make it public soon, after including some more tests.

I am wondering about the following: suppose that for a given parameter \theta, I am simulating N individuals to obtain some average statistic

\bar{h}(\theta) = \frac1N \sum_{i = 1}^N g(\theta)

and my objective is some \bar{h}(\theta) - h_0.

Intuitively, it would make sense to use a low N for he initial phase for finding the right region, and a large N for the later phase, because in the beginning I care more about convergence than accuracy, then I ran refine later. Eg 10k vs 100k seems to work well. Is there any systematic, theoretically supported way of doing this, eg an N(k) function scheme?

2 Likes

glad you got it working, itā€™s a very clever algorithm. I donā€™t know about the idea of N(k) it seems this would be very problem specific. For example suppose your simulation is just this side of having infinite varianceā€¦ N will need to be very large. On the other hand if your simulation is well behaved it could be much smaller.

Central Limit Theorem should give some idea for what N should be in order to bound the error in \bar{h}(\theta) at a certain level, especially if the variance of g(\theta) does not change a lot with \theta.

1 Like

Are you getting better results with your SPSA than nloptā€™s CRS2?

Iā€™ve been working on some higher dimensional problems where CRS2 currently has the best convergence, so Iā€™m curious to try your SPSA when itā€™s ready.

It appears that https://github.com/robertfeldt/BlackBoxOptim.jl has an SPSA implementation.

4 Likes

Are there any new developments on this type of problems?

I am currently using @Tamas_Papp MultistartOptimization.jl which works like a charm, but I am always interested intrying new things.

Personally, 3 years after asking the original question my lesson in optimizing noisy objectives is to avoid it at all cost.

It is so inefficient with larger dimensions (compared to alternative methods, especially anything that uses at least first derivatives) that even spending a month or two at reformulating a problem with all the tricks I can think of is usually worth it. Coupled with multiple local modes, it is usually a nightmare for anything above 10ā€“20 dimensions (depending on how nasty the multimodality is).

6 Likes

I have a similar problem in economics, where some model implied moments need to be simulated to compute the likelihood.

Did you end up using AD on the simulator?

As a side-note, Ken Judd recommended the POUNDERS algorithm for derivative-free optimization of noisy and non-smooth objective functions if the objective is a non-linear least-squares problem .I havenā€™t seen a Julia implementation though. Iā€™m not sure how different this is from DFO-LS.jl, mentioned above.

1 Like

Yes. I now try to code all simulations so that they are ADā€™able from the very beginning, and test for this in CI so as not to break it inadvertently even if I donā€™t use it at the moment.

How to do this is very model-specific. Usually it requires rethinking the moments you are targeting, from the very beginning. My standard bag of tricks include mapping an [0,1] number to durations for models with (competing) Poisson processes [pretty much all labor market models], and for multiple discrete outcomes using the cumulative probability for each (which is differentiable).

I could discuss these in a series of blog posts if there is interest.

3 Likes

We have a new NeurIPS paper and AD coming out at the end of the week which handles this case :sweat_smile:

2 Likes

All of these tricks are known and have been around for a long time, I claim no originality. A standard reference is eg

@book{rubinstein1993discrete,
  title={Discrete event systems: sensitivity analysis and stochastic optimization by the score function method},
  author={Rubinstein, Reuven Y and Shapiro, Alexander},
  volume={13},
  year={1993},
  publisher={Wiley}
}

they also have a nice book about MC.

The score function method is biased and has high variance. This is fixed and made automatic.

Interesting. Can you please add a reference to the paper you mention?

Not until itā€™s released later this week. I set a reminder to respond here. I think it goes up Wednesday?

I would be interested. And I think future economists who decide to adopt Julia would find your insights helpful!

The paper is out now, with the Twitter thread summary.

The package is:

7 Likes

Just skimmed through it, this should enable cool stuff like Hamiltonian Monte Carlo with discrete parameters without having to rely on marginalization, right?

:+1: yeah, thereā€™s a lot of fun follow-ups like that. Tons and tons and tons of student projects haha.

1 Like

Cool abstract I will try to read the paper later today.

I will say that at one point I played with the SPSA method for unbiased numerical differentiation and it was able to drive an HMC type calculation but only stayed stable with extremely small timesteps in my test case. But itā€™s evidence that this kind of thing is workable!

1 Like

Very cool! As someone who works with Markov-switching models I am very interested in trying this out.

Can the package handle mutations?