Numerical stability in Turing.jl

boogiewoogie · August 21, 2023, 2:40pm

I went through Computational Tricks with Turing <br/> (Non-Centered Parametrization <br/> and QR Decomposition), where it’s suggested to decorrelate the posterior in a Bayesian regression model by a QR decomposition of the data matrix.

There, Q is scaled by (approximately) x20 (the sqrt of the number of data points) and R by x1/20. It is stated that this is done for numerical stability. This is the first time I’ve heard about this, and not rescaling yields very different results, e.g. a posterior mean of ~2 for one parameter without the rescaling instead of ~6 with.

Scaling the data by only x20 having such a big impact on the estimation results is a bit worrying to me. Is this a common issue in Turing.jl or have I stumbled upon an extreme edge case?

Also, is there any reference as to why that scaling factor improves numerical stability in this case?

ElOceanografo · August 21, 2023, 9:15pm

Is that difference in the posteriors taking into account the rescaling by \mathbf{R}^{*-1} (equation 3 in the page you linked)?

I haven’t run the example so I can’t say for sure, but one possible explanation is that gradient-based samplers like NUTS often have problems with numbers that are close to the floating-point epsilon–the gradients can get wonky and lose precision, leading to inaccurate sampling and bad posteriors. Perhaps using the unscaled Q matrix leads to the \beta values being too small?

sethaxen · August 22, 2023, 7:03am

You can read more about this transform here: 1.2 The QR reparameterization | Stan User’s Guide

boogiewoogie · August 22, 2023, 7:07am

Yes.

I thought something like that may be the culprit, but what makes me second guess is the fact that the mentioned difference of posterior means ~2 vs ~6 is consistent across many different NUTS samples. If it was just the sampler hicking up and becoming imprecise, I’d expect inconsistencies instead.

boogiewoogie · August 22, 2023, 7:10am

Thanks! I’ve just been reading that, but unfortunately there’s no explanation for the included rescaling (which also is N here, as opposed to \sqrt{N-1} in the first link) provided there.

sethaxen · August 22, 2023, 7:30am

I’m not certain if the reason is practical or principled. But I don’t

I don’t follow. Both pages suggest rescaling by \sqrt{N-1}:

No, poor posterior geometry can result in consistent but wrong samples.

It’s very hard to help debug a model without seeing it. I recommend providing a complete example that shows the difference in posterior means you’re seeing.

boogiewoogie · August 22, 2023, 7:57am

I’m sorry! I meant The QR Decomposition For Regression Models, just glanced over the URL. Unfortunately, there’s no explanation given in both of the posts.

You’re right, sorry. Here’s a script to reproduce, that’s pretty much the condensed procedure from my initial link:

using Turing, Random
using DataFrames, CSV, HTTP

Random.seed!(123)


url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/kidiq.csv"
kidiq = CSV.read(HTTP.get(url).body, DataFrame)

X = Matrix(select(kidiq, Not(:kid_score)))
y = kidiq[:, :kid_score]



@model function linreg(X, y; predictors=size(X, 2))
    #priors
    α ~ Normal(mean(y), 2.5 * std(y))
    β ~ filldist(TDist(3), predictors)
    σ ~ Exponential(1)
    #likelihood
    return y ~ MvNormal(α .+ X * β, σ^2 * I)
end;



Q, R = qr(X)
Q = Matrix(Q)

function qr_sample(k=20)
    Qₖ = Q*k
    Rₖ = R/k
    model = linreg(Qₖ, y)
    chain = sample(model, NUTS(1000, 0.65), MCMCThreads(), 1000, 4)

    betas = mapslices(x -> Rₖ^-1 * x, chain[:, namesingroup(chain, :β), :].value.data; dims=[2] )    
    return rescaled_chain = setrange(Chains(betas, ["β[$i]" for i in 1:size(Qₖ, 2)]), 1_001:1:2_000 )
end


# qr_sample(k=20)  ---->  Factor suggested in linked posts. Posterior mean of β[1] is ~6.2, which is consistent at least with k >= 10.
# qr_sample(k=1)   ---->  No scaling. Posterior mean of β[1] is ~2.1 and quickly approaches the above ~6.2 for k = 2...7

Ralph_Smith · August 23, 2023, 2:56am

You’re not incorporating the parameter transformation into the priors; IIUC this is done automatically for the example in the Stan reference.
Is the bulk of the samples in one case or the other in the tails of the prior? Are the data not very informative? If so, the transformation issue could account for significant posterior discrepancy.

Topic		Replies	Views
Turing.jl - NUTS gets stuck in "The current proposal will be rejected... isfinite.((θ, r, ℓπ, ℓκ)) = (true, true, false, true)" Probabilistic programming turing	7	2752	November 22, 2022
Bayesian logistic regression with Turing.jl Probabilistic programming turing , monte-carlo	29	4451	May 18, 2021
QR Parameterization in Turing Probabilistic programming turing	2	528	March 16, 2021
Copulas.jl + Turing.jl - numerical stability? Probabilistic programming turing , copula	6	272	April 19, 2024
Numerical Stability Numerics	1	639	March 6, 2020

Numerical stability in Turing.jl

Related topics