Turing: censored regression example

Building upon my previous example of truncated regression, I thought I’d do a worked example of censored regression. To clarify, censoring is where observations below a lower bound and/or observations above an upper bound are recorded at being at those bounds. An example would be weighting something heavier than the maximum weight that a measuring device can handle, and recording it as the max weight on the dial.

If we think about that in the context of a simple linear regression model, we could be concrete about censoring, defining it in a function:

function censor_data(x, y, L, U)
    censor = fill(0, length(x))
    yc = copy(y)
    for i in 1:length(x)
        if y[i] > U
            censor[i] = 1
            yc[i] = U
        elseif y[i] < L
            censor[i] = -1
            yc[i] = L
        else
            censor[i] = 0
        end
    end
    return x, yc, censor
end

The vector censor has values 0 for uncensored data, -1 for left/lower censored data, and 1 for right/upper censored data. This is necessary for the censored regression model that we’ll construct below.

We can then generate some fake data, censor it, and visualise the process of censoring

# generate data
m, c, σ, N = 1, 0, 2, 200
x = rand(Uniform(-10, 10), N)
y = rand.(Normal.(m.*x.+c, σ))
# Apply censoring
L, U = -5, 5
xc, yc, censor = censor_data(x, y, L, U)

Our censored regression model can be defined as below. I borrowed from the suggestion by @joshualeond in this post.

using Turing

@model function censored_regression(x, y, censor)
    m ~ Normal(0, 2)
    c ~ Normal(0, 2)
    σ ~ truncated(Normal(0, 20), 0, Inf)
    for i in 1:length(x)
        if censor[i] == 0  # observed
            Turing.@addlogprob! logpdf(Normal(m*x[i]+c, σ), y[i])
        elseif censor[i] == 1 # right/upper censored
            Turing.@addlogprob! logccdf(Normal(m*x[i]+c, σ), y[i])
        elseif censor[i] == -1 # left/lower censored
            Turing.@addlogprob! logcdf(Normal(m*x[i]+c, σ), y[i])
        end
    end
end

We can then sample and check the chains

chain = sample(censored_regression(xc, yc, censor), NUTS(), 2000)
plot(chain)


Everything looks good. The posterior for m is pretty close, and the posterior for c is slightly low, but nothing overly problematic.

Visualising the posterior predictive confirms that the model is a good fit to the data. By using the censored regression model, we have avoided the bias in estimating the slope that would have occurred with regular regression alone.

12 Likes

Very nice!

It’s interesting, I usually think of PPLs as a way to combine existing distributions (or meaures). So I’d probably approach this by first defining a new Censored measure, and then using that. It’s especially fun that Censored would be singular wrt Lebesgue measure, so we’d need to represent it on top of as a superposition of Lebesgue and Dirac measures.

I’ll need to see about getting this set up in MeasureTheory.jl, then we can compare that with your Turing implementation.

Thanks @cscherrer. I’m less of a mathematician/statistician than you, so I might not be following you exactly. But are you suggesting to replicate truncated() into a new censored function to work with distributions? From the user / model builder perspective, being able to write something like

y[i] ~ censored(Normal(m*x[i]+c, σ, L, U)

would be pretty amazing.

2 Likes

Sorry if I was a little cryptic, but it sounds like you got the idea.

On the “singular wrt Lebesgue measure” thing… Say you took billions of samples from a censored distribution, and tried to build histograms. The samples at the limits of the distribution would tend to overwhelm the rest. The distribution is concentrated there, so in a sense the units don’t work out properly, and it’s a mess.

But this is the kind of use case we’re targeting for MeasureTheory, so it makes for a nice test.

1 Like

The problem you’re describing is modelled with Tobit Regression in my world. That would be an interesting thing to add to one of the econometrics packages.

Yep, I believe (but don’t know for sure) that this is doing exactly the same as Tobit regression, but with priors over m, c, and σ.

The lines

for i in 1:length(x)
    if censor[i] == 0  # observed
        Turing.acclogp!(_varinfo, logpdf(Normal(m*x[i]+c, σ), y[i]))
    elseif censor[i] == 1 # right/upper censored
        Turing.acclogp!(_varinfo, logccdf(Normal(m*x[i]+c, σ), y[i]))
    elseif censor[i] == -1 # left/lower censored
        Turing.acclogp!(_varinfo, logcdf(Normal(m*x[i]+c, σ), y[i]))
    end
end

modify the likelihood in the same way as the Tobit model, I believe. But happy to be corrected on that.

You can use Turing.@addlogprob! instead of Turing.acclogp!:

Turing.@addlogprob! logpdf(Normal(m*x[i]+c, σ), y[i])

The macro just rewrites it with Turing.acclogp! but you don’t have to specify (and hence are not warned about) the internal variable _varinfo.

4 Likes

That’s pretty nice. I’ll edit the post to use that approach.

Distributions.jl now supports censored distributions, which you can use in the likelihood of your Turing model https://juliastats.org/Distributions.jl/stable/censored/

1 Like

Very nice

@model function censored_regression(x, y, L, U)
    m ~ Normal(0, 2)
    c ~ Normal(0, 2)
    σ ~ truncated(Normal(0, 20), lower=0)
    y .~ censored.(Normal.(m .* x .+ c, σ), lower=L, upper=U)
end
4 Likes