Numerical stability in Turing.jl

I went through Computational Tricks with Turing <br/> (Non-Centered Parametrization <br/> and QR Decomposition), where it’s suggested to decorrelate the posterior in a Bayesian regression model by a QR decomposition of the data matrix.

There, Q is scaled by (approximately) x20 (the sqrt of the number of data points) and R by x1/20. It is stated that this is done for numerical stability. This is the first time I’ve heard about this, and not rescaling yields very different results, e.g. a posterior mean of ~2 for one parameter without the rescaling instead of ~6 with.

Scaling the data by only x20 having such a big impact on the estimation results is a bit worrying to me. Is this a common issue in Turing.jl or have I stumbled upon an extreme edge case?

Also, is there any reference as to why that scaling factor improves numerical stability in this case?

Is that difference in the posteriors taking into account the rescaling by \mathbf{R}^{*-1} (equation 3 in the page you linked)?

I haven’t run the example so I can’t say for sure, but one possible explanation is that gradient-based samplers like NUTS often have problems with numbers that are close to the floating-point epsilon–the gradients can get wonky and lose precision, leading to inaccurate sampling and bad posteriors. Perhaps using the unscaled Q matrix leads to the \beta values being too small?

You can read more about this transform here: 1.2 The QR reparameterization | Stan User’s Guide


I thought something like that may be the culprit, but what makes me second guess is the fact that the mentioned difference of posterior means ~2 vs ~6 is consistent across many different NUTS samples. If it was just the sampler hicking up and becoming imprecise, I’d expect inconsistencies instead.

Thanks! I’ve just been reading that, but unfortunately there’s no explanation for the included rescaling (which also is N here, as opposed to \sqrt{N-1} in the first link) provided there.

I’m not certain if the reason is practical or principled. But I don’t

I don’t follow. Both pages suggest rescaling by \sqrt{N-1}:

No, poor posterior geometry can result in consistent but wrong samples.

It’s very hard to help debug a model without seeing it. I recommend providing a complete example that shows the difference in posterior means you’re seeing.

I’m sorry! I meant The QR Decomposition For Regression Models, just glanced over the URL. Unfortunately, there’s no explanation given in both of the posts.

You’re right, sorry. Here’s a script to reproduce, that’s pretty much the condensed procedure from my initial link:

using Turing, Random
using DataFrames, CSV, HTTP


url = ""
kidiq =, DataFrame)

X = Matrix(select(kidiq, Not(:kid_score)))
y = kidiq[:, :kid_score]

@model function linreg(X, y; predictors=size(X, 2))
    α ~ Normal(mean(y), 2.5 * std(y))
    β ~ filldist(TDist(3), predictors)
    σ ~ Exponential(1)
    return y ~ MvNormal(α .+ X * β, σ^2 * I)

Q, R = qr(X)
Q = Matrix(Q)

function qr_sample(k=20)
    Qₖ = Q*k
    Rₖ = R/k
    model = linreg(Qₖ, y)
    chain = sample(model, NUTS(1000, 0.65), MCMCThreads(), 1000, 4)

    betas = mapslices(x -> Rₖ^-1 * x, chain[:, namesingroup(chain, :β), :]; dims=[2] )    
    return rescaled_chain = setrange(Chains(betas, ["β[$i]" for i in 1:size(Qₖ, 2)]), 1_001:1:2_000 )

# qr_sample(k=20)  ---->  Factor suggested in linked posts. Posterior mean of β[1] is ~6.2, which is consistent at least with k >= 10.
# qr_sample(k=1)   ---->  No scaling. Posterior mean of β[1] is ~2.1 and quickly approaches the above ~6.2 for k = 2...7

You’re not incorporating the parameter transformation into the priors; IIUC this is done automatically for the example in the Stan reference.
Is the bulk of the samples in one case or the other in the tails of the prior? Are the data not very informative? If so, the transformation issue could account for significant posterior discrepancy.