Understanding Linear Regression and finding R-Squared

I entered this dataframe and the GLM function below:

DEmployment = [0.036327933,0.034107058,0.030691709,0.029333333,-0.008635579,0.010452962,0.017672414,0.024565862,0.030177759,0.01565008,0.02094034,0.011609907,0.00841622,-0.012139605,0.006184067,0.006869125,0.008617594,-0.004271983,0.002860207,0.01426025,-0.006678383,-0.009200283,-0.003928571,0.000358551,0.006810036,0.005695977,-0.001769912,-0.007446809,0.009646302,0.009907997,0.001751927,0.036327933]

DBirths = [-0.014116817,-0.028458922,0.00036846,-0.010865562,-0.025507354,-0.007260222,-0.011932256,-0.002337359,-0.012495119,0,-0.012653223,0.057669203,0.007194245,0.030075188,-0.018691589,0.00952381,-0.009433962,0.00952381,0.028301887,0.027522936,0.008928571,0,-0.017699115,-0.009009009,0,-0.018181818,0,-0.009259259,-0.009345794,-0.028301887,-0.019417476,-0.028458922]

df = DataFrame(A=DEmployment, B=DBirths)

using GLM
ols = lm(@formula(A  ~ B), df)

I recived this output

A ~ 1 + B

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)       2.5     1.11803  2.23607    0.1548   -2.31051    7.31051
B: M              0.0     1.58114  0.0        1.0000   -6.80309    6.80309
──────────────────────────────────────────────────────────────────────────

Does B:M, refer to the probability of the slope? Where do I find R^2 for the relationship between A and B?

?r2
1 Like

I don’t think the output you’re showing is produced by the inputs you’ve posted above. Both DBirths and DEmployment are just floating point numbers, but your output B: M suggests that you have a categorical variable where one of the levels is M.

As @Mattriks says above the R-squared of the model can be obtained by calling

julia> r2(ols)
0.0685632968228097

This is also explained in the docs here, which I recommend you read if you want to work with the GLM package.

Your question about the interpretation of the B: M coefficient suggest that it would also be helpful to consult introductory level statistics or econometrics textbooks. A popular one that I’ve used myself to teach first year undergrads is Wooldridge’s Introductory Econometrics.

1 Like

I agree with @nilshg that something’s not right here. I ran your code, exactly as posted, and I got the following output:

julia> ols = lm(@formula(A  ~ B), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

A ~ 1 + B

Coefficients:
────────────────────────────────────────────────────────────────────────────────
                Estimate  Std. Error   t value  Pr(>|t|)    Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.00946811  0.00249931   3.7883     0.0007   0.00436385  0.0145724
B            -0.194064    0.130591    -1.48604    0.1477  -0.460767    0.0726395
────────────────────────────────────────────────────────────────────────────────

I’m not an expert in this arena but I’ve spent a lot of time re-teaching myself linear regression and I’ve actually been spending a lot of time recently trying to learn generalized linear models at a deeper level so I think I can contribute having been in your shoes.

Regarding the interpretation of the estimates for the coefficients, there are loads of great resources online for learning about linear regression. I usually don’t like telling people to β€œjust Google it” but I think this is one of those cases where you should just Google it and start reading. You need to not only understand what the coefficients mean, but you also need to understand the rest of the statistics in the coefficient table. For example, it’s really important in this case that you know how to interpret the p-value for your coefficient B, reported in the table under the column Pr(>|t|) (hint: can you say with confidence that the coefficient for B is not 0?).

Regarding R2 specifically, there is the r2() function but you can also compute this yourself quite easily in this case, which I think will help you gain a deeper understanding of what R2 is actually measuring:

julia> r2(ols)
0.0685632968228097 # this is the GLM r2 function

To understand how to compute this manually, I would recommend reading this. Here’s the code to do it:

using Statistics

# Total sum of squares
SSβ‚œ = sum((DEmployment .- mean(DEmployment)).^2)

# Regression sum of squares
SSα΅£ = sum((predict(ols) .- mean(DEmployment)).^2)

julia> RΒ² = SSα΅£ / SSβ‚œ
0.06856329682280965 # this matches the r2 value above
1 Like

Thanks that explained it, although I’m getting 0.00, wherease your answer looks like what I got in Excel.

what package did you use to find the mean?

Statistics:

using Statistics

It’s a standard Julia library so you shouldn’t have to Pkg.add it.

1 Like

They must be, but I don’t know how. I ran the lm formula again, and got the same output. I also get the same r2 value. So that’s good, although I’d like to know what I did wrong the first time.

OK, thanks. would doing a signed rank test mess up the regression?

no. it’s not clear how you got B : M but that wouldn’t have caused your problem.

It works! and I should have known to add using Statistics. That’s why it said that mean was undefined.

do you mean B= {b|M}