Meaning of `GLM.lm` results (`t` and `Pr(>|t|)`)

Hi all!

This is a very noobish question, I was uncertain if I should put it in “New to Julia” section, so sorry about it.

Anyway, when I fit a linear model using the GLM package, like this:

using DataFrames
using CSV
using GLM

df = DataFrame(CSV.File("raw_planar_data.csv"));

fm = @formula(z ~ x + y)
@time(model = lm(fm, df))

Julia prints this pretty table:

julia> @time(model = lm(fm, df))
  0.048781 seconds (28.79 k allocations: 2.329 MiB, 40.01% gc time, 99.00% compilation time: 100% of which was recompilation)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

z ~ 1 + x + y

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  0.0413819  0.36563       0.11    0.9099  -0.675407   0.758171
x            1.99483    0.00960031  207.79    <1e-99   1.97601    2.01365
y            0.994769   0.00950755  104.63    <1e-99   0.97613    1.01341
──────────────────────────────────────────────────────────────────────────

For what I understood, Julia performs a t-test for each parameter \beta_i, checking

\begin{align} \mathbb{H}_0 &: \beta_i = 0 \\ \mathbb{H}_1 &: \beta_i \neq 0 \end{align}

Now, please tell me if I got it right:

  • t is the value of the t-statistic for each test
  • Pr(>|t|) is the p-value for each test
  • Lower 95% and Upper 95% are the confidence interval, with significance level \alpha = 0.05, for each parameter

Is this all correct, or did I interpreted something wrongly?

Thanks!

P.S., side question (if you like): how is the standard error calculated in a linear regression test?

You all right. Anyway, @time doesn’t require (), that is, you can use that like below:

@time model = lm(fm, df)
1 Like

This sounds suspiciously like a homework problem. There are plenty of resources online to learn about how standard errors are calculated.

It isn’t, but I see why it looked like so, sorry. The more complete question should have been: “what is the statistic used to assess the mean of a coefficient in a LM, from which I can then calculate the standard error?”. I have taken a Stats course at university but it didn’t cover linear regression. Anyway, I have probably found the answer, and it’s already too advanced for my curiosity-motivated study, so I think I’ll just trust the software library and use the results.

As a resource, Stock and Watsons introduction to econometrics has a very good description of OLS

1 Like

If you move quickly away from Frequentist stats and towards Bayesian stats then the answer is always very simple: everything is derived from the posterior distribution.

The frequentist tests for regression stuff can mostly be seen as approximations to Bayes under some improper prior distribution.

I just always do Bayes, but sometimes do GLM type stuff and interpret as convenient quick approximation of Bayes.

To @pdeffebach : thanks, but my question was about the test performed: are you suggesting that because that is part of the OLS standard process?

To @dlakelan : I still didn’t grasp the difference between frequentist and Bayesian statistics. Is it important to conduct a multiple linear regression? Or can I “just trust Julia” (and let’s say, accept the parameter when Pr(>|t|) < 0.05 as usual)?

This accept and reject stuff is definitely what’s wrong with much of Frequentist stats. For example, if you have Pr(>|t|) = 0.07 will you “accept that the slope really is zero?” That is a very poor way to do things. The proper interpretation is rather that you have insufficient information to ensure exactly what sign the slope should be. In the real world almost nothing is exactly 0. And simply because you have a small sample size is no reason to conclude strongly that a parameter is actually 0. Similarly if in one dataset p<0.05 and another p>0.05 it is very wrong to say in condition one the parameter is not zero but rather equal approximately to the estimated value, and in condition two the parameter is exactly 0 and therefore the estimate of the difference of the effects is such and such…

It is worth it to avoid falling into the many many logical fallacies that are committed by the nonspecialist using the usual rituals of Null Hypothesis Significance Testing.

If you have not already had too much standard stats education you are in a good position to avoid making these mistakes :sweat_smile:perhaps look into Kruszke’s “Doing Bayesian Data Analysis” or some other similar very intro book. Mainly to build up a proper intuition for valid inferences rather than many fallacies.

See also Scientists rise up against statistical significance

Thanks for the suggestion! :+1: Yes I know that “not rejecting the null hypothesis” doesn’t mean “the null hypothesis is true”, but you’re right, knowing myself I would have got distracted and assumed it :woozy_face:

My question (a bit too pragmatical, I admit), was “is this statistic sufficiently solid to trust the usual significance level (0.05) in a normal regression problem?”. Whose answer, I get now, is “it depends”.

One minor thing: wdym here?

Because I would have said, “since I got only basic stats educations, I am especially prone to error”

I mean, it will be easier for you to unlearn the wrong thinking you were taught in 1 semester than the wrong thinking you have developed over several years of a stats masters etc.

1 Like