An update: I have programmed a version of SPSA and indeed it is working fine (after some tuning). Thanks, @dlakelan! Of course I will make it public soon, after including some more tests.
I am wondering about the following: suppose that for a given parameter \theta, I am simulating N individuals to obtain some average statistic
Intuitively, it would make sense to use a low N for he initial phase for finding the right region, and a large N for the later phase, because in the beginning I care more about convergence than accuracy, then I ran refine later. Eg 10k vs 100k seems to work well. Is there any systematic, theoretically supported way of doing this, eg an N(k) function scheme?
glad you got it working, itās a very clever algorithm. I donāt know about the idea of N(k) it seems this would be very problem specific. For example suppose your simulation is just this side of having infinite varianceā¦ N will need to be very large. On the other hand if your simulation is well behaved it could be much smaller.
Central Limit Theorem should give some idea for what N should be in order to bound the error in \bar{h}(\theta) at a certain level, especially if the variance of g(\theta) does not change a lot with \theta.
Are you getting better results with your SPSA than nloptās CRS2?
Iāve been working on some higher dimensional problems where CRS2 currently has the best convergence, so Iām curious to try your SPSA when itās ready.
Personally, 3 years after asking the original question my lesson in optimizing noisy objectives is to avoid it at all cost.
It is so inefficient with larger dimensions (compared to alternative methods, especially anything that uses at least first derivatives) that even spending a month or two at reformulating a problem with all the tricks I can think of is usually worth it. Coupled with multiple local modes, it is usually a nightmare for anything above 10ā20 dimensions (depending on how nasty the multimodality is).
I have a similar problem in economics, where some model implied moments need to be simulated to compute the likelihood.
Did you end up using AD on the simulator?
As a side-note, Ken Judd recommended the POUNDERS algorithm for derivative-free optimization of noisy and non-smooth objective functions if the objective is a non-linear least-squares problem .I havenāt seen a Julia implementation though. Iām not sure how different this is from DFO-LS.jl, mentioned above.
Yes. I now try to code all simulations so that they are ADāable from the very beginning, and test for this in CI so as not to break it inadvertently even if I donāt use it at the moment.
How to do this is very model-specific. Usually it requires rethinking the moments you are targeting, from the very beginning. My standard bag of tricks include mapping an [0,1] number to durations for models with (competing) Poisson processes [pretty much all labor market models], and for multiple discrete outcomes using the cumulative probability for each (which is differentiable).
I could discuss these in a series of blog posts if there is interest.
All of these tricks are known and have been around for a long time, I claim no originality. A standard reference is eg
@book{rubinstein1993discrete,
title={Discrete event systems: sensitivity analysis and stochastic optimization by the score function method},
author={Rubinstein, Reuven Y and Shapiro, Alexander},
volume={13},
year={1993},
publisher={Wiley}
}
Just skimmed through it, this should enable cool stuff like Hamiltonian Monte Carlo with discrete parameters without having to rely on marginalization, right?
Cool abstract I will try to read the paper later today.
I will say that at one point I played with the SPSA method for unbiased numerical differentiation and it was able to drive an HMC type calculation but only stayed stable with extremely small timesteps in my test case. But itās evidence that this kind of thing is workable!