Building upon my previous example of truncated regression, I thought I’d do a worked example of censored regression. To clarify, censoring is where observations below a lower bound and/or observations above an upper bound are recorded at being at those bounds. An example would be weighting something heavier than the maximum weight that a measuring device can handle, and recording it as the max weight on the dial.
If we think about that in the context of a simple linear regression model, we could be concrete about censoring, defining it in a function:
function censor_data(x, y, L, U)
censor = fill(0, length(x))
yc = copy(y)
for i in 1:length(x)
if y[i] > U
censor[i] = 1
yc[i] = U
elseif y[i] < L
censor[i] = -1
yc[i] = L
else
censor[i] = 0
end
end
return x, yc, censor
end
The vector censor
has values 0
for uncensored data, -1
for left/lower censored data, and 1
for right/upper censored data. This is necessary for the censored regression model that we’ll construct below.
We can then generate some fake data, censor it, and visualise the process of censoring
# generate data
m, c, σ, N = 1, 0, 2, 200
x = rand(Uniform(-10, 10), N)
y = rand.(Normal.(m.*x.+c, σ))
# Apply censoring
L, U = -5, 5
xc, yc, censor = censor_data(x, y, L, U)
Our censored regression model can be defined as below. I borrowed from the suggestion by @joshualeond in this post.
using Turing
@model function censored_regression(x, y, censor)
m ~ Normal(0, 2)
c ~ Normal(0, 2)
σ ~ truncated(Normal(0, 20), 0, Inf)
for i in 1:length(x)
if censor[i] == 0 # observed
Turing.@addlogprob! logpdf(Normal(m*x[i]+c, σ), y[i])
elseif censor[i] == 1 # right/upper censored
Turing.@addlogprob! logccdf(Normal(m*x[i]+c, σ), y[i])
elseif censor[i] == -1 # left/lower censored
Turing.@addlogprob! logcdf(Normal(m*x[i]+c, σ), y[i])
end
end
end
We can then sample and check the chains
chain = sample(censored_regression(xc, yc, censor), NUTS(), 2000)
plot(chain)
Everything looks good. The posterior for m is pretty close, and the posterior for c is slightly low, but nothing overly problematic.
Visualising the posterior predictive confirms that the model is a good fit to the data. By using the censored regression model, we have avoided the bias in estimating the slope that would have occurred with regular regression alone.