Turing: truncated regression example

Hi all. I’ve found that there are no real definitive guides on truncated vs censored regression in a Bayesian context. They are given some treatment in some Bayesian texts, but I’ve not found it easy to find worked examples in different PPL’s - and I have seen some examples which get truncation and censoring a bit muddled. I believe I’ve got a working example of truncated regression. I’m hoping that I can get some confirmation that this is the right way of implementing truncated regression in Turing.

First, let’s be clear what truncation is. It is where data outside of some bounds are discarded. The person doing the linear regression on real data knows nothing about the data outside these bounds, nor how much there was. In a simple regression context where we truncate our outcome variable y, it would correspond to this…

function truncate(x, y, bounds)
    L, U = bounds
    keep = (y.>L) .& (y.<U)
    return x[keep], y[keep]
end

Then we generate some data and then truncate it

m, c, σ, N = 1, 0, 2, 300
x = rand(Uniform(-10, 10), N)
y = rand.(Normal.(m.*x.+c, σ))
L, U = -5, 5
xt, yt = truncate(x, y, (L, U))

resulting in a dataset looking like this


Now we can define a regular regression model

using Turing

@model function linear_regression(x, y)
    m ~ Normal(0, 2)
    c ~ Normal(0, 2)
    σ ~ truncated(Normal(0, 20), 0, Inf)
    for i in 1:length(x)
        y[i] ~ Normal(m*x[i]+c, σ)
    end
end

And a truncated regression model where the likelihood is truncated at the bounds

@model function truncated_regression(x, y, lims)
    L, U = lims
    m ~ Normal(0, 2)
    c ~ Normal(0, 2)
    σ ~ truncated(Normal(0, 20), 0, Inf)
    for i in 1:length(x)
        y[i] ~ truncated(Normal(m*x[i]+c, σ), L, U)
    end
end

Then we conduct parameter estimation on both of these models on the truncated data

chain_lin = sample(linear_regression(xt, yt), NUTS(), 5000)
chain_trunc = sample(truncated_regression(xt, yt, (L, U)), NUTS(), 5000)

And we can write some plot code to visualise the data and the posterior predictive fits

function visualise(x, y, chain, subplot, true_params)
    m, c = true_params
    scatter!(x, y, color=:grey, markerstrokewidth=0, label=:none, subplot=subplot)
    xi = [minimum(x), maximum(x)]
    n_samples = length(chain[:m])
    samples_to_draw = 100
    samples_to_plot = sample([1:n_samples;], samples_to_draw; replace=false)
    for i in samples_to_plot
        plot!(xi, chain[:m][i].*xi.+chain[:c][i], color=:black, alpha=0.1, label=false, subplot=subplot)
    end
    plot!(xi, mean(chain[:m]).*xi.+mean(chain[:c]), color=:red, lw=2, label="posterior mean", ylim=[minimum(y)*1.5, maximum(y)*1.5], subplot=subplot)
    plot!(xi, m.*xi.+c, color=:black, lw=3, label="true", ylim=[minimum(y)*1.5, maximum(y)*1.5], subplot=subplot)
    plot!(tick_direction=:out)
end

plot(layout=(1,2), size=(1000, 500))
visualise(xt, yt, chain_lin, 1, (m, c))
plot!(title="truncated data\nlinear regression", subplot=1)
hline!([L, U], subplot=1, colour=:black, label=false)

visualise(xt, yt, chain_trunc, 2, (m, c))
plot!(title="truncated data\ntruncated regression", subplot=2)
hline!([L, U], subplot=2, colour=:black, label=false)

resulting in this

As we can see, the linear regression applied to the truncated data has bias on the slope parameter. However, applying the truncated regression model to the same truncated data results in a much better estimate of the regression slope.

Does all this look correct? If so, I’ll attempt to do an equivalent example for censored regression.

2 Likes

As far as I can tell, this seems correct. One way to verify is to compare generated data to your PDF. What I found is that the PDF fits the generated data quite well.

using Distributions, Plots

x = rand(Normal(0,1), 10^6)

xt = filter(x-> x >= -2 && x <= 2.0, x)

vals = -2:.01:2

dens = pdf.(truncated(Normal(0,1), -2, 2), vals)

histogram(xt, norm=true)

plot!(vals, dens)
1 Like