Advice speeding up a turing model

I have a model in Turing which is exceptionally slow and am hoping for some help speeding it up. I am new to Turing so any pointers are helpful.

My model is

@model function many_linreg(X, y; predictors=size(X, 2))
    #priors
    α ~ Normal(0, sqrt(250))
    β ~ filldist(Uniform(0, 1e8), predictors)
    σ ~ Uniform(0, 1e4)
    N = length(y)
    #likelihood
    for i in 1:N
        y[i] ~ MvNormal(α .+ X * β, σ^2 * I)
    end
end

I sample it with

chain2 = sample(model2, NUTS(), MCMCThreads(), 1_000, 20, adtype=AutoReverseDiff(true))

I’d like to run this eventually on ~30k observation vectors, but I can’t make it run in a reasonable time on even 5k. I just get stuck on 0% completion of sampling after finding initial step sizes.

I have tried the common tricks I know to speed things up. For instance I am using a QR decomposed matrix of predictors, and backwards differentiation with tape. I tried to vectorize the for loop but get errors when trying to call filldist on it. I think maybe I should be using arraydist on single normals but don’t know?

Do you really want that each observation y[i] is a multi-variate normal?
I.e., a linear regression model would have

y .~ Normal.(α .+ X * β, σ)
# or
y ~ MvNormal(α .+ X * β, σ^2 * I)

Each y[i] is a 64 element vector each entry of which is subject to iid homoskedastic normal noise. I have a list of 30k such y[i] bundled into the vector y. So shouldn’t each y[i] be multivariate normal then?

Ah, ok that was not clear to me. So this means that each of the samples y[i] is from the same distribution?
If that’s the case, you could aggregate the samples and compute the likelihood based on the sufficient statistics. Further, those wide priors can hinder sampling, i.e., try to use something more informed like a reasonably constrained normal or a wider tail if less information is available. Here is an interesting post on the role/influence of priors.

Unfortunately I have very very weak priors, which I feel can’t be improved from what I gave. The y values may be from one distribution or from a mixture of several distributions. It is again hard to know. I’d like to accelerate the model as written, just using any optimizations available in Julia/turing.

Maybe you could group all the y[i]s into a single matrix-variate distribution with a block-diagonal covariance?

2 Likes

Specifically, you could implement @gdalle’s suggestion like this:

using PDMats

...
# outside the model
M = length(first(y))
Y = reduce(hcat, y)
# inside the model
Y ~ MatrixNormal(repeat(α .+ X * β, 1, N), ScalMat(M, σ^2), ScalMat(N, 1))

I’d be surprised if this massively improved sampling though. Uniform priors or correlated predictors often cause problems here, and you probably have more information about the priors than you think you have. e.g. this prior says that it’s equally likely for σ to be 0, 1, or 10^4, but 10^4 + 1e-12 is absolutely impossible. Is that what you actually believe? Often you have some range outside of which you would be surprised. At the very least, your surprise at seeing a high value probably smoothly increases as you approach 10^4 instead of suddenly kicking in at 10^4. So a heavy-tailed truncated prior like truncated(TDist(2)*1e3; lower=0) could make more sense and improve sampling.

When you have genuinely uninformative priors, and some of your predictors might be correlated, the QR reparameterization (see e.g. the Stan docs) often helps a lot. Here, it would look like:

using LinearAlgebra

function _qr(x)
    Q, R = qr(x)
    return Q, UpperTriangular(R)
end

@model function many_linreg_qr(X, y; predictors=size(X, 2), M=size(X, 1), Q=sqrt(M-1)*Matrix(qr(X)), R=UpperTriangular(qr(X).R)/sqrt(M-1))
    #priors
    α ~ Normal(0, sqrt(250))
    θ ~ filldist(Uniform(0, 1e8), predictors)
    β := R \ θ
    σ ~ Uniform(0, 1e4)
    N = length(y)
    #likelihood
    ydist = MvNormal(α .+ (Q * θ), σ^2 * I)
    for i in 1:N
        y[i] ~ ydist
    end
end

Note that placing the prior on \theta instead of \beta results in a completely different implicit prior on \beta (the resulting prior assumes a similar correlation structure as the predictors themselves have, which is what sampling will generally produce anyways), but this is fine here since you don’t know what your priors should be anyways.

3 Likes

Essentially there’s two things to consider: how fast your model is, and how many times you have to run it. Classic Julia performance tricks will only improve item 1, and there’s a limit to how fast you can go. Then, item 2 probably depends on how well your model fits your data (which is why unreasonable priors may legit hurt you).