Turing: censored regression example

Building upon my previous example of truncated regression, I thought I’d do a worked example of censored regression. To clarify, censoring is where observations below a lower bound and/or observations above an upper bound are recorded at being at those bounds. An example would be weighting something heavier than the maximum weight that a measuring device can handle, and recording it as the max weight on the dial.

If we think about that in the context of a simple linear regression model, we could be concrete about censoring, defining it in a function:

function censor_data(x, y, L, U)
    censor = fill(0, length(x))
    yc = copy(y)
    for i in 1:length(x)
        if y[i] > U
            censor[i] = 1
            yc[i] = U
        elseif y[i] < L
            censor[i] = -1
            yc[i] = L
        else
            censor[i] = 0
        end
    end
    return x, yc, censor
end

The vector censor has values 0 for uncensored data, -1 for left/lower censored data, and 1 for right/upper censored data. This is necessary for the censored regression model that we’ll construct below.

We can then generate some fake data, censor it, and visualise the process of censoring

# generate data
m, c, σ, N = 1, 0, 2, 200
x = rand(Uniform(-10, 10), N)
y = rand.(Normal.(m.*x.+c, σ))
# Apply censoring
L, U = -5, 5
xc, yc, censor = censor_data(x, y, L, U)

Our censored regression model can be defined as below. I borrowed from the suggestion by @joshualeond in this post.

using Turing

@model function censored_regression(x, y, censor)
    m ~ Normal(0, 2)
    c ~ Normal(0, 2)
    σ ~ truncated(Normal(0, 20), 0, Inf)
    for i in 1:length(x)
        if censor[i] == 0  # observed
            Turing.@addlogprob! logpdf(Normal(m*x[i]+c, σ), y[i])
        elseif censor[i] == 1 # right/upper censored
            Turing.@addlogprob! logccdf(Normal(m*x[i]+c, σ), y[i])
        elseif censor[i] == -1 # left/lower censored
            Turing.@addlogprob! logcdf(Normal(m*x[i]+c, σ), y[i])
        end
    end
end

We can then sample and check the chains

chain = sample(censored_regression(xc, yc, censor), NUTS(), 2000)
plot(chain)


Everything looks good. The posterior for m is pretty close, and the posterior for c is slightly low, but nothing overly problematic.

Visualising the posterior predictive confirms that the model is a good fit to the data. By using the censored regression model, we have avoided the bias in estimating the slope that would have occurred with regular regression alone.

8 Likes

Very nice!

It’s interesting, I usually think of PPLs as a way to combine existing distributions (or meaures). So I’d probably approach this by first defining a new Censored measure, and then using that. It’s especially fun that Censored would be singular wrt Lebesgue measure, so we’d need to represent it on top of as a superposition of Lebesgue and Dirac measures.

I’ll need to see about getting this set up in MeasureTheory.jl, then we can compare that with your Turing implementation.

Thanks @cscherrer. I’m less of a mathematician/statistician than you, so I might not be following you exactly. But are you suggesting to replicate truncated() into a new censored function to work with distributions? From the user / model builder perspective, being able to write something like

y[i] ~ censored(Normal(m*x[i]+c, σ, L, U)

would be pretty amazing.

2 Likes

Sorry if I was a little cryptic, but it sounds like you got the idea.

On the “singular wrt Lebesgue measure” thing… Say you took billions of samples from a censored distribution, and tried to build histograms. The samples at the limits of the distribution would tend to overwhelm the rest. The distribution is concentrated there, so in a sense the units don’t work out properly, and it’s a mess.

But this is the kind of use case we’re targeting for MeasureTheory, so it makes for a nice test.

1 Like

The problem you’re describing is modelled with Tobit Regression in my world. That would be an interesting thing to add to one of the econometrics packages.

Yep, I believe (but don’t know for sure) that this is doing exactly the same as Tobit regression, but with priors over m, c, and σ.

The lines

for i in 1:length(x)
    if censor[i] == 0  # observed
        Turing.acclogp!(_varinfo, logpdf(Normal(m*x[i]+c, σ), y[i]))
    elseif censor[i] == 1 # right/upper censored
        Turing.acclogp!(_varinfo, logccdf(Normal(m*x[i]+c, σ), y[i]))
    elseif censor[i] == -1 # left/lower censored
        Turing.acclogp!(_varinfo, logcdf(Normal(m*x[i]+c, σ), y[i]))
    end
end

modify the likelihood in the same way as the Tobit model, I believe. But happy to be corrected on that.

You can use Turing.@addlogprob! instead of Turing.acclogp!:

Turing.@addlogprob! logpdf(Normal(m*x[i]+c, σ), y[i])

The macro just rewrites it with Turing.acclogp! but you don’t have to specify (and hence are not warned about) the internal variable _varinfo.

2 Likes

That’s pretty nice. I’ll edit the post to use that approach.