Conformal Predictive distributions

Note: this is updated code (corrects bugs in previous versions)

Recently, there has been a rise in interest in “prior-free posterior distributions”.
There is a related neat new package ConformalPrediction.jl (by @pat-alt)
This package gives prediction intervals for supervised ML models.

Using training data D_{n} \equiv \{ (X_i, Y_i)\}_{i=1}^{n}, fit a model Y_{i} = f(X_{i}).
Next, given a new X_{n+1}, a prediction interval is a set-valued function
C_{1-\alpha}= C(D_{n}, X_{n+1}, \alpha) = \left[ L_{1-\alpha}, U_{1-\alpha}\right]
such that P( Y_{n+1} \in C_{1-\alpha}) \geq 1-\alpha for miscoverage \alpha \in [0,1].

My goal is to recover the conditional CDF \hat{F}(Y_{n+1}|X).

My first guess.
Suppose we have a very simple stylized environment. W/ symmetric prediction intervals, w/ perfect coverage, centered at the median etc.
If P( Y_{n+1} \in C_{1-\alpha}) = 1-\alpha and C_{1-\alpha}= \left[ L_{1-\alpha}, U_{1-\alpha}\right]
then P( Y_{n+1} \leq L_{1-\alpha}) = \alpha/2 and P( Y_{n+1} \leq U_{1-\alpha}) = 1- \alpha/2,
then for quantile q\in [0,0.5], \hat{F}^{-1}(q)=L_{1-2q}, and for quantile q\in [0.5,1], \hat{F}^{-1}(q)=U_{2q-1}.
Does this make sense?

It’s possible to imagine cases where this wouldn’t work, for example C_{90\%}=[L_{90\%},U_{90\%}] where P( Y_{n+1} \leq L_{90\%}) =2\% and P( Y_{n+1} \leq U_{90\%}) = 92\%.
However we still have P(Y_{n+1}\in C_{90\%}) = 90\%.

Here is an illustration:

using MLJ, Distributions, ConformalPrediction, Plots, Random, Tables;
LinearRegressor =  @load LinearRegressor  pkg=MLJLinearModels
################################################################################
# Make Data  #
################################################################################
n= 10_000; σ=1.0;
X = [ones(n) sin.(1:n) cos.(n:-1:1) exp.(sin.(1:n))]
θ = [9.0; 7.0; 3.0; 21]
Noise = randn(MersenneTwister(49), n);
cef(x;θ=θ)  = x*θ;      # Conditional Expectation Function 
skedastic(x;σ=σ) = σ;   # Skedastic Function
y = cef(X;θ=θ) + skedastic(X;σ=σ)*Noise
train, calibration, test = partition(eachindex(y), 0.4, 0.4, shuffle=true, rng=444)
################################################################################
# Fit model on training data  #
################################################################################
model = LinearRegressor(fit_intercept = false) 
mach = machine(model, Tables.table(X), y)
fit!(mach, rows=train)
pr_y = predict(mach)
### make sure predictions unbiased outside the training sample.
[mean(pr_y[jj] - y[jj]) for jj in [:, train, calibration, test] ]
# plot(pr_y - y) # plot(pr_y[test] - y[test]) # plot(pr_y[train] - y[train])
################################################################################
# Calibrate on calibration data  #
# Compute prediction intervals on test data #
################################################################################
conf_mach = conformal_machine(mach)
calibrate!(conf_mach, selectrows(X, calibration), y[calibration])

# 1: compare simulated coverage w/ theoretical coverage
grid_coverage = 0.00:0.01:1.00;
sim_coverage =[];
for jj in eachindex(grid_coverage)
    PI = predict(conf_mach, X[test,:]; coverage = grid_coverage[jj] )
    PI_LB = [PI[j][1][2][] for j in 1:length(test)]
    PI_UB = [PI[j][2][2][] for j in 1:length(test)]
    m =mean(PI_LB .<= y[test] .<= PI_UB)   # % of time ynew is in PI. 
    push!(sim_coverage, m)
end
plot(legend=:topleft, xlabel = "Desired Coverage", ylabel = "Simulated Coverage",)
plot!(grid_coverage, sim_coverage, lab="Simulated Coverage")
plot!(grid_coverage, x->x, c=3, l=:dash, lab="Perfect Coverage") # , lw=3

# 2: Plot PI for coverage = 90% 
PI_90 = predict(conf_mach, X[test,:]; coverage = 0.90)
PI_90_LB = [PI_90[j][1][2][] for j in 1:length(test)]
PI_90_UB = [PI_90[j][2][2][] for j in 1:length(test)]

plot()
plot!(PI_90_LB)
plot!(PI_90_UB)
plot!(y[test])
plot!(pr_y[test])

# 3: Plot PI length for each X[test,:] for coverage = 90% 
plot( PI_90_UB .- PI_90_LB, ylims=(0,2*mean( PI_90_UB .- PI_90_LB)), lab="Length 90% CI for each X")

# 4: recover CDF F̂(Y_{n+1} | X) at one point. Compare w/ true CDF
taz = 14; xt = [X[test[taz],:] ;;]'
grid_coverage = 0.00:0.01:1.00; LB = []; UB = [];
for jj in eachindex(grid_coverage)
    PI = predict(conf_mach, xt; coverage = grid_coverage[jj] )
    push!(LB, PI[1][1][2][])
    push!(UB, PI[1][2][2][])
end
#
sim_quantiles = [sort(LB);UB]
probs         = [0.5*grid_coverage; 0.5 .+ 0.5*grid_coverage]
point_predict = pr_y[test[taz]]
CDF_true      = Normal((xt*θ)[], σ)
#
plot(legend=:topleft)
plot!(sim_quantiles, probs, lab="Empirical quantiles")
plot!([point_predict], seriestype = :vline, lab="y prediction", color="red") 
plot!(sim_quantiles, x -> cdf(CDF_true, x)  , lab="True CDF, N(Xθ,σ)",l=:dash, lw=3)

1: compare the simulated coverage w/ perfect coverage
image

2: plot the PI length for 90% coverage at each X in the test sample
image
Note: Constant length doesn’t allow for heteroskedastic PIs as in Chernozhukov et al PNAS 2021.

3: attempt to recover the CDF from the conformal PIs & compare w/ the true CDF
image

Here is code to compare 6 models (linear, EvoTrees, LGBM, XGB, NuSVR, EpsilonSVR):

# Fit multiple models 
mod_set =[LinearRegressor(fit_intercept = false)
    EvoTreeRegressor(nrounds = 100)
    LGBMRegressor(num_iterations = 100) 
    XGBoostRegressor(num_round = 100) 
    NuSVR()
    EpsilonSVR()
    ];
#

grid_coverage = 0.00:0.01:1.00;
#
m_pr_y = zeros(n, length(mod_set))
m_sim_coverage = zeros(length(grid_coverage),length(mod_set))
m_sim_quantiles = zeros(2*length(grid_coverage),length(mod_set)) 
m_probs = zeros(2*length(grid_coverage),length(mod_set)) 
m_point_predict = zeros(length(mod_set))

# plot(legend=:topleft)
for mm in eachindex(mod_set)
    # mm=3
    model = mod_set[mm]
    println("$model")
    #
    ### Fit on training data 
    #
    mach = machine(model, Tables.table(X), y)
    fit!(mach, rows=train)
    pr_y = predict(mach)
    m_pr_y[:,mm] = pr_y
    #
    # [mean(pr_y[jj] - y[jj]) for jj in [:, train, calibration, test] ] |> display
    # plot!(pr_y - y, lab="$model") |> display # plot(pr_y[test] - y[test]) # plot(pr_y[train] - y[train])
    #
    ### Calibrate on calibration data/Compute prediction intervals on test data
    #
    conf_mach = conformal_machine(mach)
    calibrate!(conf_mach, selectrows(X, calibration), y[calibration])
    # 1: compare simulated coverage w/ theoretical coverage
    # grid_coverage = 0.00:0.01:1.00;
    sim_coverage =[];
    for jj in eachindex(grid_coverage)
        PI = predict(conf_mach, X[test,:]; coverage = grid_coverage[jj] )
        PI_LB = [PI[j][1][2][] for j in 1:length(test)]
        PI_UB = [PI[j][2][2][] for j in 1:length(test)]
        m =mean(PI_LB .<= y[test] .<= PI_UB)   # % of time ynew is in PI. 
        push!(sim_coverage, m)
    end
    m_sim_coverage[:,mm] = sim_coverage
    #
    # 4: recover CDF F̂(Y_{n+1} | X) at one point. Compare w/ true CDF
    taz = 14; 
    # xt = [X[test[taz],:] ;;]'
    xt = selectrows(X, test[taz])
    grid_coverage = 0.00:0.01:1.00; LB = []; UB = [];
    for jj in eachindex(grid_coverage)
        PI = predict(conf_mach, xt; coverage = grid_coverage[jj] )
        push!(LB, PI[1][1][2][])
        push!(UB, PI[1][2][2][])
    end
    #
    sim_quantiles = [sort(LB);UB]
    probs         = [0.5*grid_coverage; 0.5 .+ 0.5*grid_coverage]
    point_predict = pr_y[test[taz]]
    # CDF_true      = Normal((xt*θ)[], σ)
    m_sim_quantiles[:,mm] = sim_quantiles
    m_probs[:,mm] =probs 
    m_point_predict[mm] = point_predict
end     

plot(legend=:topleft, xlabel = "Desired Coverage", ylabel = "Simulated Coverage",)
plot!(grid_coverage, x->x, c=3, l=:dash, lw=3, lab="Perfect Coverage") # 
plot!(grid_coverage, m_sim_coverage[:,1], lab="Linear")
plot!(grid_coverage, m_sim_coverage[:,2], lab="EvoTreeRegressor")
plot!(grid_coverage, m_sim_coverage[:,3], lab="LGBMRegressor")
plot!(grid_coverage, m_sim_coverage[:,4], lab="XGBoostRegressor")
plot!(grid_coverage, m_sim_coverage[:,5], lab="NuSVR")
plot!(grid_coverage, m_sim_coverage[:,6], lab="EpsilonSVR")
##########
plot(legend=:topleft, xlabel = "Desired Coverage", ylabel = "Coverage Error",)
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,1], lab="Linear")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,2], lab="EvoTreeRegressor")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,3], lab="LGBMRegressor")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,4], lab="XGBoostRegressor")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,5], lab="NuSVR")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,6], lab="EpsilonSVR")
plot!([0.0], seriestype = :hline, l=:dash, lab="perfect", color="red") 
##########################
plot(legend=:topleft)
#
plot!(m_sim_quantiles[:,1], m_probs[:,1], lab="Linear Regression")
plot!(m_sim_quantiles[:,2], m_probs[:,2], lab="EvoTreeRegressor")
plot!(m_sim_quantiles[:,3], m_probs[:,3], lab="LGBMRegressor")
plot!(m_sim_quantiles[:,4], m_probs[:,4], lab="XGBoostRegressor")
plot!(m_sim_quantiles[:,5], m_probs[:,5], lab="NuSVR")
plot!(m_sim_quantiles[:,6], m_probs[:,6], lab="EpsilonSVR")
# plot!([m_point_predict], seriestype = :vline, lab="y point predictions", color="red") 
plot!(range(minimum(m_sim_quantiles), maximum(m_sim_quantiles), length=202), x -> cdf(CDF_true, x)  , lab="True CDF, N(Xθ,σ)",l=:dash, lw=3)

First compare coverage across models:
image
Or coverage error
image
Next compare the simulated conditional CDFs \hat{F}(Y_{n+1}|X) with the truth
image

Can anyone point me to theory about the distribution of the statistic \hat{F}(Y_{n+1}|X)?

Another simple example: suppose the error \varepsilon \sim U(0,1), then y= X\theta +\sigma \varepsilon \sim U(X\theta, X\theta +\sigma), hence the true conditional CDF F(y|x) = \frac{y-x\theta}{\sigma}.
Repeat code above except:

Noise = rand(MersenneTwister(49), n); # Uniform[0,1] 
CDF_true      = Uniform(μ,μ +σ)

image

2 Likes

If I understood you correctly, in Bayesian statistics, it comes under “non-informative improper prior”. This is pretty standard in Bayesian statistics, perhaps you may consider looking into Andrew Gelman’s book on Bayesian Data Analysis. From my memory, I am trying to remember and it is fairly straightforward.

  • The standard non-informative prior distribution is uniform on (\beta,\log \sigma^2) which is equivalent to :
p(\beta,\log(\sigma^2))\propto p(\beta) p(\log(\sigma^2)) = \sigma^{-2}
  • By Bayes Rule, the posterior distribution is:
p(\beta_1,...,\beta_k,\sigma^2|y,X)\propto \prod_{i=1}^n p(y_i|\mu_i,\sigma^2)p(\beta_1,...,\beta_k,\sigma^2)
  • This prior is a good choice for statistical models when you have a lot of data points and only a few parameters.

  • The Bayesian solution is equivalent to OLS or MLE and you enjoy the advantage of direct Bayesian probabilistic statement.

  • The posterior distribution of \sigma^2 can be written as:

p(\sigma^2|y,X)\sim Scaled-Inv \chi^2(n-k,s^2)

where s^2=(y-X\hat{\beta})^T(y-X\hat{\beta})/(n-k)

  • The marginal posteror distribution of \beta follow Multivariate t-distribution, i.e.,
p(\beta|y,X)\sim t_{n-k}(\hat{\beta},s^2(X^TX)^{-1})
  • Notice the close comparison with the classical results. The key difference would be an interpretation of the standard errors.

  • To find the predictive distribution, we require more information. Suppose X_0 is the test point and we want to estimate y_0.

  • So we assume prior on y_0 as y_0\sim N(X_0\beta, \Sigma), then we have

\begin{pmatrix} y\\ y_0 \end{pmatrix}\sim N \begin{pmatrix} \begin{bmatrix} X\beta\\ X_0\beta \end{bmatrix}, \begin{bmatrix} s^2(X^TX)^{-1} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{00} \end{bmatrix} \end{pmatrix}
  • We can express
y_0|y\sim N(X_0\hat{\beta}+\Sigma_{21}^{-1}\Sigma_{00}^{-1}(y-X_{0}\hat{\beta}),s^2(X^TX)^{-1} -\Sigma_{21}\Sigma_{00}^{-1}\Sigma_{12}).
  • Hence posterior estimates of y_0 for given X_0 is
\mathbb{E}(y_0|y,X,X_0) =X_0\hat{\beta}+\Sigma_{21}^{-1}\Sigma_{00}^{-1}(y-X_{0}\hat{\beta}),
  • Note that \Sigma_{21}^{-1}\Sigma_{00}^{-1}(y-X_{0}\hat{\beta}) is often known as Kalman update.

I hope this was helpful.
Thanks and regards,
Sourish

3 Likes

So it’s not Bayes, also not Bayes with non-informative priors, here are notes by Larry Wasserman who is involved in this https://www.stat.cmu.edu/~larry/=sml/Conformal

1 Like

First of all, wow @Albert_Zevelev, it’s really cool that you’ve already engaged with this package so much - thank you! I’m sorry for the delayed response, was away for the past couple of days and have come down w/ Covid.

Just some high-level thoughts on this form my side:

I’m finding it difficult to think about conformal predictive distributions in the same way that I’m used to thinking about posterior predictive distributions (in the Bayesian setting as outlined by @Sourish). In particular, since CP is inherently free of prior distributional assumptions, it does not follow the standard Bayesian paradigm where we update prior beliefs about parameter distributions using data/evidence to then form posterior beliefs. If your goal is to derive a functional form for the predictive distribution, I think the Bayesian context is perhaps more useful.

As for theoretical properties of the resulting predictive distribution in CP, the literature seems to talk about validity guarantees, but not so much about the functional properties of the resulting distribution. The paper I linked here and related papers (see here and here) appear to derive empirical predictive distributions along the lines of what you’ve done here. This is useful because:

Their advantage over the usual conformal prediction intervals is that a conformal predictive distribution Q_n contains more information and can produce a plethora of prediction intervals.

Vovk et al. (2017)

But again, not sure how much (if anything) can really be said about the functional form of Q_n.

At this point, for my own research I am primarily interested in instance-based post-hoc uncertainty estimates (for which conformal predictors are sufficient it seems). Nonetheless, recovering predictive distributions is of course very interesting, so I’ll definitely keep the issue open.

Sorry that I can’t be of more help and thanks again for this!

1 Like

I’m always skeptical of efforts to do stuff in a Bayes like way without all the Bayes machinery. Remember that Bayes is the unique formalism that satisfies Cox’s theorem. If conformal prediction isn’t Bayes then it does something that makes it not satisfy Cox’s theorem and this pretty much guarantees that thing is illogical and weird. However, it might not be an issue in practice in some circumstances.

1 Like

Interesting - thanks. It gives me an idea of connecting Lary’s note on “Conformal Prediction” with standard “Bayesian updating”. I will sharpen the argument and come back. I think there is a one-to-one relationship between “Conformal Prediction” and “Bayesian updating”. Thanks for the link again.

Also, on another second look at Lary’s note, it looks somewhat like Fisher’s Fiducial Inference or Frequentist Pivot idea. I must read the papers in Lary’s note to understand philosophy completely.

If you can share more literature (links) on this topic that will be a great help.

2 Likes

My fav paper on this:
https://www.pnas.org/doi/abs/10.1073/pnas.2107794118

1 Like

Aah wonderful … thanks a lot

Here is the most relevant paper that can implement CPD for any underlying regressor, whether statistical of machine learning Cross-conformal predictive distributions there is also toy python package pysloth · PyPI

2 Likes

@Sourish Conformal Prediction has nothing to do with Bayesian. Bayesian does not have validity guarantees, whilst CP always has validity guarantees in final samples of any size, regardless of the underlying regressor and the data distribution. https://arxiv.org/abs/2107.00363

1 Like

Yes, I figured out that - the “Conformal Prediction” philosophy is very different from Bayesian.

  • I don’t understand what you mean by “validity guarantees”?

  • Anyway, we will figure out the difference between apples and oranges :slight_smile:

As long as you accept whatever bogus definition of “validity” was used. This is one of those cases where a common word is given a technical meaning and then traded on its common meaning… like “significant”

All I have to do is say “valid means corresponds to having the properties of agreeing with boolean logic in the limit as in Cox’s theorem” and then 100% Bayesian has a validity guarantee, and Conformal Prediction is guaranteed to be invalid.

It seems from Conformal prediction - Wikipedia that Conformal Predictions are about converting point estimates of ML models into set-valued estimates, and in such a manner that the frequency that the truth is not in the set-valued prediction is no more than some fixed frequency \alpha

According to Wald’s theorem An Essentially Complete Class of Admissible Decision Functions Conformal Prediction is strictly dominated by some element of the class of Bayesian decision problems even in terms of frequency of getting the right answer.

@dlakelan I hate to be the “tone police” but no need to use the adjective"bogus" here.
I appreciate anyone taking the time to provide feedback, including you

More generally in statistics & other technical fields, authors often define a “desirable property” X of an estimator and construct an estimator w/ that property (under certain assumptions P).
Other authors can disagree on whether X is in fact “desirable”, or on whether the assumptions P under which the estimator is “desirable” is realistic enough for certain purposes…

1 Like

Worth adding that Bayes and CP are not mutually exclusive ideas. As Anastasios and Bates put it in Section 2.4 of their tutorial on CP:

“[…] we should incorporate any prior information we have into our prediction sets.“

2 Likes

Yes, I considered whether that was necessary or not, in the end, I think the problem is that it’s a very serious problem that produces a lot of downstream social, political, and scientific issues to take a technical definition and then name it after a common and loaded word like “valid”. So in this instance in some sense I’m calling out what I consider an actual bogus practice.

If for example the technical property is that “when the assumptions are true, the intervals are guaranteed to contain the true value precisely 95% of the time in the long run under repeated random sampling…” then call this “precise frequency guarantees” or something not “validity”.

1 Like

In the tutorial @pat-alt linked, they define:
a prediction set C(X_{\text{test}}) is valid in the following sense:
1 -\alpha \leq P\left( Y_{\text{test}} \in C(X_{\text{test}}) \right) \leq 1 -\alpha + \frac{1}{n+1}
where (X_{\text{test}}, Y_{\text{test}}) is a fresh test point from the same distribution, and \alpha \in [0,1] is a user-chosen error rate.

Also, they call:
1 -\alpha \leq P\left( Y_{\text{test}} \in C(X_{\text{test}}) | X_{\text{test}} \right) the conditional coverage property
1 -\alpha \leq P\left( Y_{\text{test}} \in C(X_{\text{test}}) \right) \leq 1 -\alpha + \frac{1}{n+1} the marginal coverage property (aka valid PIs)
They cite research that argues in general conditional coverage is impossible to achieve…

These don’t seem like unreasonable definitions to me…

This is a “Frequency is the definition of probability and everything else is invalid” assumption. It is quite offensive and antagonistic. It’s basically designed to let you say things like “Bayesian intervals are invalid”.

Let’s make confidence intervals not more frequentist than they are: I give under the assumptions that the full procedure of sampling a single time and computing a confidence interval with >= 95% probability will produce an interval that contains the true parameter

2 Likes

@mschauer Not quite sure what your statement means? The thing you’re quoting was not written with any kind of careful attention to details I admit.

Also to all, I don’t really want to derail this thread a bunch. Conformal Prediction is probably useful to some people some of the time, and is probably compatible with Bayesian intervals under some conditions. I just really dislike overloading the meaning of common words in ways that harm critical thinking. I think “significant” is one of those examples. How often do doctors doing research discover that a treatment is “not significant” and therefore obviously it has no effect, is meaningless, no one should study it etc. Also I once had a dentist tell me that an expensive treatment produced a “significant difference in gum pocket depth”. Reading carefully the brochure said difference was on average approximately the diameter of a single epithelial cell… Bayes measures a different thing than frequency in the long run, but the thing it measures is “valid”.

2 Likes

Humm … I am sure similar results are also available for Bayesian predictive regression. I am pretty sure - I have seen similar results - I think similar results were given by our prof as an assignment. I may need to pull my old notebooks from my graduate study days. My only thing will be if the bound is sharper or not - that I cannot tell. Let me work on it, and I will get back to you on the same. Thanks again.

1 Like