I don’t know if any Julia package implements it, but SPSA optimization is specifically made for this kind of thing. It gives you pseudo-gradients in O(1) function eval.
Easy to code up yourself if needed.
I don’t know if any Julia package implements it, but SPSA optimization is specifically made for this kind of thing. It gives you pseudo-gradients in O(1) function eval.
Easy to code up yourself if needed.
My version of CMAES gets down to 1e-6 on the abs2 case, but it take about 4000 function evaluations to get there.
I did not check out your specific CMAES implementation, but in my experience it is usually rather inefficient.
Currently the most successful solution is NLopt’s :GN_CRS2_LM
suggested by @aaowens above. Simulated annealing works fine on the MWE but stalls on the real problems (I had to kill it in after hour, CRS2 gets me to a reasonable optimum in a minute).
I am now thinking about making this problem differentiable, but it will be tricky. CRS2 consistently finds optima in about 500-1000 iterations, which is fine for me at the moment.
Regarding the objective function: normalizing it with the covariance (like GMM) or using percentage difference between model and data may help.
I don’t have much experience with discrete choice models but I wonder why fixing the seed of the random number generator doesn’t kill the stochasticity.
We wrapped a python based algorithm for gradient free nonlinear least squares a few months ago, see https://github.com/QuantEcon/DFOLS.jl
It turned out not to be useful for our problem, but if you think it matches your problem, give it a shot.
The behavior of SA is very much dependent on the rt control. You might try setting it to rt=0.5, if you’re inclined to experiment more. SA has the advantage that it is very robust, if rt is set high enough to avoid focusing on a local min. But what is “high enough” is problem-specific. Too high, and it’s slow, too low, and you may skip over the global min.
Given that this is an indirect inference problem, I don’t understand from the code where the number of simulation replications comes in. With more replications, the objective function becomes smoother and easier to optimize. Also, as you know, fixing the random seed so that the simulations use common random numbers across different objective function evaluations is needed to get convergence of parameters. I don’t see that in the MWE.
As I mentioned multiple times above, there are discrete outcomes (and branches) in the simulation. Common random numbers or seeding the RNG does not help with continuity at all, unfortunately.
If necessary, I can make an MWE that makes this feature prominent, but it is equally useful just to pretend that we can’t use CRN.
Sure, but essentially we are trading runtime for O(1/\sqrt{n}) noise. The question is eg whether it is better to run 16 repications for 1/4 of the noise, or just use an algorithm that can make use of 16x samples as is. CRS2 seems to be very good at this.
Common random numbers won’t help with continuity, but they will keep the global minimum at the same point in the parameter space across different trial parameter values. With random draws that change at every function evaluation, the global minimum moves at every iteration. This is the phenomenon known as “chatter”, and Daniel McFadden’s papers discussed the need to avoid it.
Smoothing the objective function is a possibility. I think that’s a secondary issue, though.
Increasing the number of simulations affects the asymptotic variance of the parameter estimates, lowering it. You can use antithetic random draws, and other tricks. This is unrelated to the number of iterations that the minimization algorithm uses. Being able to try many different parameter values for an objective function that is less informative than is possible will not lead to the best possible parameter estimates. Trying to make the objective smoother (more simulations for each parameter value) and stable (common random numbers across trial parameter value) should help, I would think.
I don’t think CRN are appropriate here. The goal is to optimize the expected value of the noisy objective function, not to find the minimum value out of all the evaluations (since the location of that point will be heavily influenced by the idiosyncratic noise), so it shouldn’t be a problem if the global minimum moves around as a result of the noise.
As the noise gets higher, I think it actually becomes increasingly important to explicitly model the noise with samples using the same parameter values (but different RNG seeds). Otherwise, how would one distinguish between variation in the objective caused by idiosyncratic noise vs variation caused by change in parameter values (what one is actually trying to optimize over)?
A couple references on the value of using replications with Bayesian Optimization:
If using CRN resulted in noise that were highly correlated between points nearby in parameter space, I could see some value in that approach, but that’s not the case at least for my simulations, which also have discrete actions and a high amount of branching.
I agree with you and Bayesian optimization was the first thing I started with. But in practice, it is very slow and makes no steps most of the time once somewhat near the optimum. Maybe I am tuning it wrong.
That said, I don’t think my noise is that large.
Did you try the SPSA method I linked above? It works very well for this exact kind of problem.
Not yet, it is on my TODO list. Is there a recommended Julia implementation?
not that I know of, but if you point me to a howto on registering a package I’d be happy to create one, it’s like 20-50 likes of code and is v useful.
Thanks, I want to understand it so I think I will just implement it.
Just some ideas:
Note that we are building out GitHub - SciML/Surrogates.jl: Surrogate modeling and optimization for scientific machine learning (SciML) for this as part of JSoC.
Just want to echo @mcreel’s points: if you use different random number draws for each function evaluation, whatever optimizer you pick is going to have a really hard time convincing itself that its found even a local optima. Further, regardless of whether the function that generates outcomes at the “true” parameter value is smooth in the realizations of the random draws, what matters for asymptotics is whether its expectation is smooth, which it probably (???) is.
I experimented with this (setting the random seed before each simulation), and it makes no difference.
The simulation has about 10⁶ branches, each affecting the number of random draws needed, so practically it is IID even with a small difference in inputs.
I really thought that this is the textbook use case for Bayesian optimization, and was surprised to see that CRS2 dominates it.
interesting. I think if you were to stick to the textbook mcfadden approach here, you’d really need to pre-draw the entirety of every draw that you could need, for every identical path of branching. This would guarantee that if two different parameter values result in the same branching sequence for some part of your data, the contribution of that part of your data to the objective function would need to be the same. maybe your code is written in such a way that fixing the seed does this, but if different parameter values result in the same branches happening at different points of time during the simulation, maybe not…
Random thought: would a splittable PRNG help here? For example, it looks like SplitMix might allow you to have the same random numbers on a given branch of the simulation.