Weighted linear regression with confidence interval fitted to error bars

Hi,
I’m pretty new to Data Science.
I have some Data with measurement errors and I want to fit a linear model.

using DataFrames
using GLM
using StatsPlots

df = DataFrame(x = [0.0, 0.0669873, 0.25, 0.5, 0.75, 0.933013, 1.0],
y = [0.223, 0.291, 0.393, 0.549, 0.73, 0.85, 0.896],
u_y = [0.023, 0.024, 0.027, 0.031, 0.037, 0.041, 0.043])

# Weighted Least Squares
wlr = glm(@formula(y~x), df, Normal(), wts = 1 ./ df.u_y.^2)

x_interval = DataFrame(x = 0:0.01:1);
pred = predict(wlr, x_interval, interval = :confidence)
@df df scatter(:x, :y, yerror = :u_y)
plot!(x_interval.x, pred.prediction,
ribbon = (pred.prediction .- pred.lower, pred.upper .- pred.prediction))


I get this result:

The confidence interval should get plotted but it is so small that you can’t even see it.
Shouldn’t it be as wide as the measurement errors?
How do I fit it accordingly?

1 Like

Your question is hopefully answered here Plot the confidence interval for a model fit

1 Like

GLM.jl’s glm does not interpret the weights wts as inverse variances but as prior frequencies:

    - wts::Vector=similar(y,0): Prior frequency (a.k.a. case) weights of observations.
Such weights are equivalent to repeating each observation a number of times equal
to its weight. Do note that this interpretation gives equal point estimates but
different standard errors from analytical (a.k.a. inverse variance) weights and
from probability (a.k.a. sampling) weights which are the default in some other
software.
Can be length 0 to indicate no weighting (default).


There are two different confidence intervals to consider, the confidence interval for the posterior mean, and the interval for a new measurement. You are talking about the interval for a new measurement, which contains uncertainty about both the mean and the measurement.

2 Likes

So to get the right confidence interval I have to sum up the confidence interval I get from glm with the mean of the measurement errors?

As indicated in GLM.jl’s doc above, glm does not handle inverse-variance weighting.

For this purpose, you may use LsqFit.jl.

using DataFrames, LsqFit, Printf, Plots; gr()

df = DataFrame(x = [0.0, 0.0669873, 0.25, 0.5, 0.75, 0.933013, 1.0],
y = [0.223, 0.291, 0.393, 0.549, 0.73, 0.85, 0.896],
u_y = [0.023, 0.024, 0.027, 0.031, 0.037, 0.041, 0.043])

x, y = df.x, df.y
wt = 1 ./ df.u_y .^2

p0 = [0.5, 0.5]
m(x, p) = p[1] .+ p[2] * x         # p: model parameters
fit = curve_fit(m, x, y, wt, p0)

cf = coef(fit)
ci = confidence_interval(fit, 0.05)    # 5% significance level

str = @sprintf("Y = (%.2f +/- %.2f) + (%.2f +/- %.2f)*X",
cf[1],diff([ci[1]...])[1]/2, cf[2],diff([ci[2]...])[1]/2)

tl, bl = ci[1][1] .+ ci[2][2]*x,   ci[1][2] .+ ci[2][1]*x
σp, σm = maximum([tl bl], dims=2) .-  m(x,cf),  m(x,cf) .- minimum([tl bl], dims=2)

plot(x, cf[1] .+ cf[2]*x, color=:lightblue, ribbon=(σp,σm), label=str)
plot!(x, cf[1] .+ cf[2]*x, color=:blues, lw=1, label=false, xlabel="X",ylabel="Y")
scatter!(x,y, ms=3,label=false,mc=:blue, yerror=df.u_y, legend=:topleft)


6 Likes

Maybe, but probably not. It really depends on what question you want to answer. Usually, you use measurements to learn about some underlying system, and you postulate that the system behaves according to the model you have identified. You are then typically interested in learning about the posterior over the model parameters after having seen the data. The confidence interval for a new measurement is usually not very interesting since you often do not care about the measurement, you care about the thing you are trying to measure. It therefore makes sense to plot the confidence bounds like @rafael.guerra did with a shadede ribbon in the figure above, around the posterior mean of the model prediction.

2 Likes