Running TemporalGPs/Stheno sampling

willtebbutt · May 28, 2021, 7:58pm

Oh, interesting – I noticed that as well as well occassionally.

I think it’s to do with the noise being added in this line to the initialisation:

θ_flat_init + randn(length(θ_flat_init)),

Certainly, I’ve found that the problem disappears if you remove the randn term.

I had that in my example to make it a bit more interesting so that the algorithm didn’t converge immediately, and therefore a bit more interesting, but adding noise to the initialisation isn’t generally a thing that’s useful.

jzr · May 28, 2021, 8:07pm

@willtebbutt For me, deleting the + randn(length(θ_flat_init)) always produces the weird plot

using Stheno
using Dates
using AbstractGPs
using KernelFunctions
using TemporalGPs
using Optim
using ParameterHandling
using Zygote
using Zygote: gradient
using ParameterHandling: value, flatten
using DataFrames

MAX_PERIODS = 50


df_periods = let n = 10000; sort(DataFrame(t=Float64.(rand((-MAX_PERIODS:MAX_PERIODS),n)), e=randn(n)), :t) end
df_periods = DataFrames.transform(df_periods, :t => ByRow(t-> t > 0 ? t/50 - 1.0 : 0.0) => :y)
df_periods.y .+= df_periods.e

import CairoMakie
fig = CairoMakie.Figure()
ax = CairoMakie.Axis(fig[1,1])
CairoMakie.scatter!(df_periods.t, mean.(df_periods.y), markersize=2, strokewidth=0)
fig



df_before = filter(x->x.t <= 0, df_periods)
df_after = filter(x->x.t > 0, df_periods)




unpack(θ_flat::Vector{Float64}) = value(unflatten(θ_flat))

build_model(params) = to_sde(GP(kern(params)), SArrayStorage(Float64))



function nlml(θ_flat)
    θ = unpack(θ_flat)
    f_before = build_model(θ)
    f_after = build_model(θ)
    lp_before = -logpdf(f_before(df_before.t, θ.s_noise), df_before.y)
    lp_after = -logpdf(f_after(df_after.t, θ.s_noise), df_after.y)
    lp_before + lp_after
end


θ = (
    s_kernel = positive(.1),
    λ = positive(20.),
    s_noise = positive(3.0),
)
θ_flat_init, unflatten = flatten(θ);



kern(params) = params.s_kernel^2 * (Matern32Kernel() ∘ ScaleTransform(params.λ))


results = Optim.optimize(
    nlml,
    θ->gradient(nlml, θ)[1],
    θ_flat_init,
    BFGS(),
    Optim.Options(
        show_trace=true,
    );
    inplace=false,
)
θ_bfgs = unpack(results.minimizer);




f_bfgs_before = build_model(θ_bfgs);
f_bfgs_after = build_model(θ_bfgs);
f_posterior_bfgs_before = posterior(f_bfgs_before(df_before.t, θ_bfgs.s_noise), df_before.y);
f_posterior_bfgs_after = posterior(f_bfgs_after(df_after.t, θ_bfgs.s_noise), df_after.y);
INCREMENT = 0.01
xpred_before = -MAX_PERIODS:INCREMENT:0.0 |> collect
xpred_after = 1.:INCREMENT:MAX_PERIODS |> collect
f_posterior_bfgs_pred_before = f_posterior_bfgs_before(xpred_before)
f_posterior_bfgs_pred_after = f_posterior_bfgs_after(xpred_after)
ms_bfgs_pred_before = marginals(f_posterior_bfgs_pred_before);



using Plots
plotly()
p = Plots.plot()
Plots.plot!(p, f_posterior_bfgs_pred_before)
Plots.plot!(p, f_posterior_bfgs_pred_after)
p

willtebbutt · May 28, 2021, 8:42pm

Right. So the reason for this is that the objective surface is non-convex, and the optimiser is getting stuck in a local optimum which explains all of the data via noise.

If you initialise with something like

θ = (
    s_kernel = positive(1.0),
    λ = positive(1.0),
    s_noise = positive(1.0),
)

you should be fine. The initialisation used above made the stretch (inverse-lengthscale) far too large, which meant that the model was initialised to a kernel which produces very quickly varying samples, so the optimiser didn’t move far from there. Perhaps you intended to obtain a length scale of roughly 20, in which case initialising lambda to positive(1 / 20) would perhaps have been what you intended?

p.s. there’s also a type instability in your data generating mechanism for y – changing the 0 to 0.0 resolves it!

cgeoga · May 28, 2021, 9:41pm

if this isn’t exactly your problem, you might also consider at least using the expected Fisher matrix for a sort of quasi-second order optimization instead of using BFGS, sort of like Fisher scoring or something. I’ve spent a lot of time fitting Matern covariance functions and have definitely found investing the computation of second-order information to be worth it.

willtebbutt · May 29, 2021, 8:44am

Oh interesting – I can certainly believe that that’s a valuable thing to do. Could I ask how you generally go about approximating the Fisher matrix in this case? (I would imagine there’s not a nice closed form expression because the parameters of the covariance function arent’ e.g. the natural parameters of an expoential family)

cgeoga · May 29, 2021, 2:14pm

For covariance parameters of a Gaussian process, it actually has a nice closed form—

\mathcal{I}_{j,k} = \frac{1}{2} \text{tr}\left( \Sigma^{-1} \left[ \partial_{\theta_j} \Sigma \right] \Sigma^{-1} \left[ \partial_{\theta_k} \Sigma \right] \right),

where I’m leaving out the dependence of each \Sigma on all the paramers \mathbf{\theta}. But it occurs to me, maybe you aren’t ever actually forming the covariance matrix \Sigma and are using a Kalman filter or something to evaluate the likelihood. After an admittedly quick look at the source code, I see you have methods for build_Σs, but I’m totally unfamiliar with Zygote and so maybe this isn’t what I think it is. If you are assembling \Sigma, seeing as you seem to get at worst quasilinear scaling I would guess you’re either using the sparse precision induced by the Markov assumption or something to get fast matrix operations, in which case the above is probably computable in the same complexity. If it isn’t, there are some shockingly effective Hutchinson-type trace estimators for \mathcal{I} that I can expand on if you’re interested.

If you’re not assembling \Sigma, it would probably be easier to use the observed information, which I would think Zygote.hessian should provide. Maybe that’s too slow to be worth it in most cases, though, so I could be wasting your time here.

Topic		Replies	Views
Using TemporalGPs General Usage gaussian-process	1	373	June 21, 2021
Understanding Stheno example Probabilistic programming gaussian-process , stheno	1	627	April 16, 2021
Stheno marginals Probabilistic programming stheno	7	634	April 11, 2021
Implementing sparse Gaussian process model with Turing and Stheno Probabilistic programming question , turing	15	1946	December 23, 2020
Stheno.jl@0.7 Probabilistic programming	0	321	April 2, 2021

Running TemporalGPs/Stheno sampling

Related topics