Note: this is updated code (corrects bugs in previous versions)
Recently, there has been a rise in interest in “prior-free posterior distributions”.
There is a related neat new package ConformalPrediction.jl (by @pat-alt)
This package gives prediction intervals for supervised ML models.
Using training data D_{n} \equiv \{ (X_i, Y_i)\}_{i=1}^{n}, fit a model Y_{i} = f(X_{i}).
Next, given a new X_{n+1}, a prediction interval is a set-valued function
C_{1-\alpha}= C(D_{n}, X_{n+1}, \alpha) = \left[ L_{1-\alpha}, U_{1-\alpha}\right]
such that P( Y_{n+1} \in C_{1-\alpha}) \geq 1-\alpha for miscoverage \alpha \in [0,1].
My goal is to recover the conditional CDF \hat{F}(Y_{n+1}|X).
My first guess.
Suppose we have a very simple stylized environment. W/ symmetric prediction intervals, w/ perfect coverage, centered at the median etc.
If P( Y_{n+1} \in C_{1-\alpha}) = 1-\alpha and C_{1-\alpha}= \left[ L_{1-\alpha}, U_{1-\alpha}\right]
then P( Y_{n+1} \leq L_{1-\alpha}) = \alpha/2 and P( Y_{n+1} \leq U_{1-\alpha}) = 1- \alpha/2,
then for quantile q\in [0,0.5], \hat{F}^{-1}(q)=L_{1-2q}, and for quantile q\in [0.5,1], \hat{F}^{-1}(q)=U_{2q-1}.
Does this make sense?
It’s possible to imagine cases where this wouldn’t work, for example C_{90\%}=[L_{90\%},U_{90\%}] where P( Y_{n+1} \leq L_{90\%}) =2\% and P( Y_{n+1} \leq U_{90\%}) = 92\%.
However we still have P(Y_{n+1}\in C_{90\%}) = 90\%.
Here is an illustration:
using MLJ, Distributions, ConformalPrediction, Plots, Random, Tables;
LinearRegressor = @load LinearRegressor pkg=MLJLinearModels
################################################################################
# Make Data #
################################################################################
n= 10_000; σ=1.0;
X = [ones(n) sin.(1:n) cos.(n:-1:1) exp.(sin.(1:n))]
θ = [9.0; 7.0; 3.0; 21]
Noise = randn(MersenneTwister(49), n);
cef(x;θ=θ) = x*θ; # Conditional Expectation Function
skedastic(x;σ=σ) = σ; # Skedastic Function
y = cef(X;θ=θ) + skedastic(X;σ=σ)*Noise
train, calibration, test = partition(eachindex(y), 0.4, 0.4, shuffle=true, rng=444)
################################################################################
# Fit model on training data #
################################################################################
model = LinearRegressor(fit_intercept = false)
mach = machine(model, Tables.table(X), y)
fit!(mach, rows=train)
pr_y = predict(mach)
### make sure predictions unbiased outside the training sample.
[mean(pr_y[jj] - y[jj]) for jj in [:, train, calibration, test] ]
# plot(pr_y - y) # plot(pr_y[test] - y[test]) # plot(pr_y[train] - y[train])
################################################################################
# Calibrate on calibration data #
# Compute prediction intervals on test data #
################################################################################
conf_mach = conformal_machine(mach)
calibrate!(conf_mach, selectrows(X, calibration), y[calibration])
# 1: compare simulated coverage w/ theoretical coverage
grid_coverage = 0.00:0.01:1.00;
sim_coverage =[];
for jj in eachindex(grid_coverage)
PI = predict(conf_mach, X[test,:]; coverage = grid_coverage[jj] )
PI_LB = [PI[j][1][2][] for j in 1:length(test)]
PI_UB = [PI[j][2][2][] for j in 1:length(test)]
m =mean(PI_LB .<= y[test] .<= PI_UB) # % of time ynew is in PI.
push!(sim_coverage, m)
end
plot(legend=:topleft, xlabel = "Desired Coverage", ylabel = "Simulated Coverage",)
plot!(grid_coverage, sim_coverage, lab="Simulated Coverage")
plot!(grid_coverage, x->x, c=3, l=:dash, lab="Perfect Coverage") # , lw=3
# 2: Plot PI for coverage = 90%
PI_90 = predict(conf_mach, X[test,:]; coverage = 0.90)
PI_90_LB = [PI_90[j][1][2][] for j in 1:length(test)]
PI_90_UB = [PI_90[j][2][2][] for j in 1:length(test)]
plot()
plot!(PI_90_LB)
plot!(PI_90_UB)
plot!(y[test])
plot!(pr_y[test])
# 3: Plot PI length for each X[test,:] for coverage = 90%
plot( PI_90_UB .- PI_90_LB, ylims=(0,2*mean( PI_90_UB .- PI_90_LB)), lab="Length 90% CI for each X")
# 4: recover CDF F̂(Y_{n+1} | X) at one point. Compare w/ true CDF
taz = 14; xt = [X[test[taz],:] ;;]'
grid_coverage = 0.00:0.01:1.00; LB = []; UB = [];
for jj in eachindex(grid_coverage)
PI = predict(conf_mach, xt; coverage = grid_coverage[jj] )
push!(LB, PI[1][1][2][])
push!(UB, PI[1][2][2][])
end
#
sim_quantiles = [sort(LB);UB]
probs = [0.5*grid_coverage; 0.5 .+ 0.5*grid_coverage]
point_predict = pr_y[test[taz]]
CDF_true = Normal((xt*θ)[], σ)
#
plot(legend=:topleft)
plot!(sim_quantiles, probs, lab="Empirical quantiles")
plot!([point_predict], seriestype = :vline, lab="y prediction", color="red")
plot!(sim_quantiles, x -> cdf(CDF_true, x) , lab="True CDF, N(Xθ,σ)",l=:dash, lw=3)
1: compare the simulated coverage w/ perfect coverage
2: plot the PI length for 90% coverage at each X in the test sample
Note: Constant length doesn’t allow for heteroskedastic PIs as in Chernozhukov et al PNAS 2021.
3: attempt to recover the CDF from the conformal PIs & compare w/ the true CDF
Here is code to compare 6 models (linear, EvoTrees, LGBM, XGB, NuSVR, EpsilonSVR):
# Fit multiple models
mod_set =[LinearRegressor(fit_intercept = false)
EvoTreeRegressor(nrounds = 100)
LGBMRegressor(num_iterations = 100)
XGBoostRegressor(num_round = 100)
NuSVR()
EpsilonSVR()
];
#
grid_coverage = 0.00:0.01:1.00;
#
m_pr_y = zeros(n, length(mod_set))
m_sim_coverage = zeros(length(grid_coverage),length(mod_set))
m_sim_quantiles = zeros(2*length(grid_coverage),length(mod_set))
m_probs = zeros(2*length(grid_coverage),length(mod_set))
m_point_predict = zeros(length(mod_set))
# plot(legend=:topleft)
for mm in eachindex(mod_set)
# mm=3
model = mod_set[mm]
println("$model")
#
### Fit on training data
#
mach = machine(model, Tables.table(X), y)
fit!(mach, rows=train)
pr_y = predict(mach)
m_pr_y[:,mm] = pr_y
#
# [mean(pr_y[jj] - y[jj]) for jj in [:, train, calibration, test] ] |> display
# plot!(pr_y - y, lab="$model") |> display # plot(pr_y[test] - y[test]) # plot(pr_y[train] - y[train])
#
### Calibrate on calibration data/Compute prediction intervals on test data
#
conf_mach = conformal_machine(mach)
calibrate!(conf_mach, selectrows(X, calibration), y[calibration])
# 1: compare simulated coverage w/ theoretical coverage
# grid_coverage = 0.00:0.01:1.00;
sim_coverage =[];
for jj in eachindex(grid_coverage)
PI = predict(conf_mach, X[test,:]; coverage = grid_coverage[jj] )
PI_LB = [PI[j][1][2][] for j in 1:length(test)]
PI_UB = [PI[j][2][2][] for j in 1:length(test)]
m =mean(PI_LB .<= y[test] .<= PI_UB) # % of time ynew is in PI.
push!(sim_coverage, m)
end
m_sim_coverage[:,mm] = sim_coverage
#
# 4: recover CDF F̂(Y_{n+1} | X) at one point. Compare w/ true CDF
taz = 14;
# xt = [X[test[taz],:] ;;]'
xt = selectrows(X, test[taz])
grid_coverage = 0.00:0.01:1.00; LB = []; UB = [];
for jj in eachindex(grid_coverage)
PI = predict(conf_mach, xt; coverage = grid_coverage[jj] )
push!(LB, PI[1][1][2][])
push!(UB, PI[1][2][2][])
end
#
sim_quantiles = [sort(LB);UB]
probs = [0.5*grid_coverage; 0.5 .+ 0.5*grid_coverage]
point_predict = pr_y[test[taz]]
# CDF_true = Normal((xt*θ)[], σ)
m_sim_quantiles[:,mm] = sim_quantiles
m_probs[:,mm] =probs
m_point_predict[mm] = point_predict
end
plot(legend=:topleft, xlabel = "Desired Coverage", ylabel = "Simulated Coverage",)
plot!(grid_coverage, x->x, c=3, l=:dash, lw=3, lab="Perfect Coverage") #
plot!(grid_coverage, m_sim_coverage[:,1], lab="Linear")
plot!(grid_coverage, m_sim_coverage[:,2], lab="EvoTreeRegressor")
plot!(grid_coverage, m_sim_coverage[:,3], lab="LGBMRegressor")
plot!(grid_coverage, m_sim_coverage[:,4], lab="XGBoostRegressor")
plot!(grid_coverage, m_sim_coverage[:,5], lab="NuSVR")
plot!(grid_coverage, m_sim_coverage[:,6], lab="EpsilonSVR")
##########
plot(legend=:topleft, xlabel = "Desired Coverage", ylabel = "Coverage Error",)
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,1], lab="Linear")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,2], lab="EvoTreeRegressor")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,3], lab="LGBMRegressor")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,4], lab="XGBoostRegressor")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,5], lab="NuSVR")
plot!(grid_coverage, grid_coverage .- m_sim_coverage[:,6], lab="EpsilonSVR")
plot!([0.0], seriestype = :hline, l=:dash, lab="perfect", color="red")
##########################
plot(legend=:topleft)
#
plot!(m_sim_quantiles[:,1], m_probs[:,1], lab="Linear Regression")
plot!(m_sim_quantiles[:,2], m_probs[:,2], lab="EvoTreeRegressor")
plot!(m_sim_quantiles[:,3], m_probs[:,3], lab="LGBMRegressor")
plot!(m_sim_quantiles[:,4], m_probs[:,4], lab="XGBoostRegressor")
plot!(m_sim_quantiles[:,5], m_probs[:,5], lab="NuSVR")
plot!(m_sim_quantiles[:,6], m_probs[:,6], lab="EpsilonSVR")
# plot!([m_point_predict], seriestype = :vline, lab="y point predictions", color="red")
plot!(range(minimum(m_sim_quantiles), maximum(m_sim_quantiles), length=202), x -> cdf(CDF_true, x) , lab="True CDF, N(Xθ,σ)",l=:dash, lw=3)
First compare coverage across models:
Or coverage error
Next compare the simulated conditional CDFs \hat{F}(Y_{n+1}|X) with the truth
Can anyone point me to theory about the distribution of the statistic \hat{F}(Y_{n+1}|X)?
Another simple example: suppose the error \varepsilon \sim U(0,1), then y= X\theta +\sigma \varepsilon \sim U(X\theta, X\theta +\sigma), hence the true conditional CDF F(y|x) = \frac{y-x\theta}{\sigma}.
Repeat code above except:
Noise = rand(MersenneTwister(49), n); # Uniform[0,1]
CDF_true = Uniform(μ,μ +σ)