Oh, interesting – I noticed that as well as well occassionally.
I think it’s to do with the noise being added in this line to the initialisation:
θ_flat_init + randn(length(θ_flat_init)),
Certainly, I’ve found that the problem disappears if you remove the randn term.
I had that in my example to make it a bit more interesting so that the algorithm didn’t converge immediately, and therefore a bit more interesting, but adding noise to the initialisation isn’t generally a thing that’s useful.
Right. So the reason for this is that the objective surface is non-convex, and the optimiser is getting stuck in a local optimum which explains all of the data via noise.
you should be fine. The initialisation used above made the stretch (inverse-lengthscale) far too large, which meant that the model was initialised to a kernel which produces very quickly varying samples, so the optimiser didn’t move far from there. Perhaps you intended to obtain a length scale of roughly 20, in which case initialising lambda to positive(1 / 20) would perhaps have been what you intended?
p.s. there’s also a type instability in your data generating mechanism for y – changing the 0 to 0.0 resolves it!
if this isn’t exactly your problem, you might also consider at least using the expected Fisher matrix for a sort of quasi-second order optimization instead of using BFGS, sort of like Fisher scoring or something. I’ve spent a lot of time fitting Matern covariance functions and have definitely found investing the computation of second-order information to be worth it.
Oh interesting – I can certainly believe that that’s a valuable thing to do. Could I ask how you generally go about approximating the Fisher matrix in this case? (I would imagine there’s not a nice closed form expression because the parameters of the covariance function arent’ e.g. the natural parameters of an expoential family)
where I’m leaving out the dependence of each \Sigma on all the paramers \mathbf{\theta}. But it occurs to me, maybe you aren’t ever actually forming the covariance matrix \Sigma and are using a Kalman filter or something to evaluate the likelihood. After an admittedly quick look at the source code, I see you have methods for build_Σs, but I’m totally unfamiliar with Zygote and so maybe this isn’t what I think it is. If you are assembling \Sigma, seeing as you seem to get at worst quasilinear scaling I would guess you’re either using the sparse precision induced by the Markov assumption or something to get fast matrix operations, in which case the above is probably computable in the same complexity. If it isn’t, there are some shockingly effective Hutchinson-type trace estimators for \mathcal{I} that I can expand on if you’re interested.
If you’re not assembling \Sigma, it would probably be easier to use the observed information, which I would think Zygote.hessian should provide. Maybe that’s too slow to be worth it in most cases, though, so I could be wasting your time here.