Performance issues with Turing and NUTS

Hello,
I am new to Turing and try to do MCMC sampling from a simple linear forward model with gaussian prior.
I got the following model:

A = randn(20, 1000)
m = size(A, 1)
n = size(A, 2)

x_true = randn(n)
σ_prior = 0.1
x = x_true + σ_prior * randn(n)
σ_obs = 0.001
y = A * x_true + σ_obs * randn(m)

@model function gaussian_model(A::Matrix{T}, 
                σ_noise::T, prior_emissions::Vector{T}, σ_prior::T, obs::Vector{T}, ::Type{T2}=Float64) where {T, T2}
    ## unknown emission field
    # σ_prior ~ truncated(Normal(0, 1), 0, Inf)
    X ~ MvNormal(prior_emissions, σ_prior)

    ## likelihood
    y = A * T2.(X)
    obs ~ MvNormal(y, σ_noise)
end

model = gaussian_model(A, σ_obs, x, σ_prior, y)

I am trying to use the NUTS sampler as followed:

chains = sample(model, NUTS(0.65), 20, progress=true, adtype=AutoReverseDiff())

I end up with a very small step size and the NUTS sampler is fast for the first 10 samples but slows down dramatically afterwards. I read that this could happen because of large data model mismatches, but I don’t think this is the case in my model.

I had to add the types as above in the model in order to not get errors with Prior sampling, for some reason otherwise A * X would be of type Vector{Any}.

Did I do something wrong in defining the model or setting up NUTS? Or is NUTS not the right sampler for this problem?

HI!

Ooooh… it seems like it’s because the model is highly overparameterized. You see, A is most likely overdetermined, so X’s posterior will be close to degenerate. That is, there are many more X that explain the same y, and the sampler has to explore all those Xs. So, due to the geometry, NUTS will often hit the maximum depth limit, which means each step will take a long, long time. The small σ_obs also doesn’t make things easier since it will make the posterior less smooth, forcing a smaller stepsize.

1 Like

Thanks, that makes sense. I guess none of the HMC sampler would work well than. I am interested in sampling from the Lasso and just started with this as a toy model, but I guess I would run into similar difficulties. Do you have a recommendation which sampler would work best in such a scenario?

@Red-Portal is spot on here; you’re trying to inform 1,000 parameters (X) using 20 observations (y). The difficult in fitting the model is inherent to the model as written, and I don’t believe that any sampler will perform well. The best approach is probably to apply some strong regularization, as this will remove areas of the parameter space. A Laplace prior (the lasso) is one approach, as it shrinks all of the parameters toward zero. However, when sampling it will not shrink parameters all the way to zero, and will often shrink parameter that you want to include too much (The king must die | andrewgelman.com). If you have a large number of parameters and only expect some small fraction of them to be non-zero, consider using a prior like the Horseshoe prior (Sparsity information and regularization in the horseshoe and other shrinkage priors | arXiv.org).

2 Likes

Unfortunately, degenerate posteriors are… by definition degenerate. This is one of the biggest blind spots for pretty much all MCMC algorithms we currently have at our disposal. The same applies to shrinkage priors such as horseshoe-type priors. However, there attempts to do better on these degenerate posteriors, both application-specific and general-purpose. For instance, Nishimura and Suchard propose a shrinkage prior with an accompanying provably convergent Gibbs sampler. Inverse problem people, who pretty much always deal with the degenerate posteriors like the one here, tend to use more specialized MCMC algorithms for the task. For instance, “proximal” samplers are currently popular. For more general-purpose solutions, check out Pigeons.jl, which provides parallel tempering MCMC. This is probably the biggest hammer we have in our toolbox.

2 Likes