Hi all. I’ve found that there are no real definitive guides on truncated vs censored regression in a Bayesian context. They are given some treatment in some Bayesian texts, but I’ve not found it easy to find worked examples in different PPL’s - and I have seen some examples which get truncation and censoring a bit muddled. I believe I’ve got a working example of truncated regression. I’m hoping that I can get some confirmation that this is the right way of implementing truncated regression in Turing.
First, let’s be clear what truncation is. It is where data outside of some bounds are discarded. The person doing the linear regression on real data knows nothing about the data outside these bounds, nor how much there was. In a simple regression context where we truncate our outcome variable y
, it would correspond to this…
function truncate(x, y, bounds)
L, U = bounds
keep = (y.>L) .& (y.<U)
return x[keep], y[keep]
end
Then we generate some data and then truncate it
m, c, σ, N = 1, 0, 2, 300
x = rand(Uniform(-10, 10), N)
y = rand.(Normal.(m.*x.+c, σ))
L, U = -5, 5
xt, yt = truncate(x, y, (L, U))
resulting in a dataset looking like this
Now we can define a regular regression model
using Turing
@model function linear_regression(x, y)
m ~ Normal(0, 2)
c ~ Normal(0, 2)
σ ~ truncated(Normal(0, 20), 0, Inf)
for i in 1:length(x)
y[i] ~ Normal(m*x[i]+c, σ)
end
end
And a truncated regression model where the likelihood is truncated at the bounds
@model function truncated_regression(x, y, lims)
L, U = lims
m ~ Normal(0, 2)
c ~ Normal(0, 2)
σ ~ truncated(Normal(0, 20), 0, Inf)
for i in 1:length(x)
y[i] ~ truncated(Normal(m*x[i]+c, σ), L, U)
end
end
Then we conduct parameter estimation on both of these models on the truncated data
chain_lin = sample(linear_regression(xt, yt), NUTS(), 5000)
chain_trunc = sample(truncated_regression(xt, yt, (L, U)), NUTS(), 5000)
And we can write some plot code to visualise the data and the posterior predictive fits
function visualise(x, y, chain, subplot, true_params)
m, c = true_params
scatter!(x, y, color=:grey, markerstrokewidth=0, label=:none, subplot=subplot)
xi = [minimum(x), maximum(x)]
n_samples = length(chain[:m])
samples_to_draw = 100
samples_to_plot = sample([1:n_samples;], samples_to_draw; replace=false)
for i in samples_to_plot
plot!(xi, chain[:m][i].*xi.+chain[:c][i], color=:black, alpha=0.1, label=false, subplot=subplot)
end
plot!(xi, mean(chain[:m]).*xi.+mean(chain[:c]), color=:red, lw=2, label="posterior mean", ylim=[minimum(y)*1.5, maximum(y)*1.5], subplot=subplot)
plot!(xi, m.*xi.+c, color=:black, lw=3, label="true", ylim=[minimum(y)*1.5, maximum(y)*1.5], subplot=subplot)
plot!(tick_direction=:out)
end
plot(layout=(1,2), size=(1000, 500))
visualise(xt, yt, chain_lin, 1, (m, c))
plot!(title="truncated data\nlinear regression", subplot=1)
hline!([L, U], subplot=1, colour=:black, label=false)
visualise(xt, yt, chain_trunc, 2, (m, c))
plot!(title="truncated data\ntruncated regression", subplot=2)
hline!([L, U], subplot=2, colour=:black, label=false)
resulting in this
As we can see, the linear regression applied to the truncated data has bias on the slope parameter. However, applying the truncated regression model to the same truncated data results in a much better estimate of the regression slope.
Does all this look correct? If so, I’ll attempt to do an equivalent example for censored regression.