Simple Linear Regression: Domain Error with 0.0

@mcreel – thanks for your explanation here.

The Imputation step I applied does not drop any
missing values, instead it replaces the records
with adjacent values (assuming the observations
are based on the same individual). After the
imputation, there are 6 rows.

The columns must not be linearly independent with this replacement strategy.

How might you troubleshoot this?

There is a large literature, just search for “rank deficient regression”. There’s no clear best solution to the problem.

1 Like

When I run this, it seems to work fine. Can you try running this in a fresh environment to make sure there isn’t something else messing this up?

On a side note, you don’t typically need to normalize in a linear regression like this.

Thank you for your note @junder873

Since each of the columns are linearly
independent, I thought normalization
would not confound the regression
model. You are saying, without this
process step, given this knowledge,
I could generate a sensible model?

Here is a stack overflow answer that does a far better job than I could.

The short version is that an OLS regression really doesn’t care, you could multiply all your values by a billion but the coefficients would stay the same. You can also multiply a single column by any number and the T-stat will remain the same, the coefficient will be scaled by the inverse of what you multiplied it by.

1 Like

Thank you @junder873

What I took away from the Stack Overflow stream was
normalization can help with printability for presentations
but is not altogether necessary, especially on modern
machines that perform some standardization by design.

For a general decision-making reference, OLS is invariant
where normalization will not significantly influence coefficient
values. Alternatively, tests like Ridge or Lasso are variant, so
normalization is encouraged for those and similar test
conditions.

In response to

Can you try running this in a fresh environment

I restarted the Julia session, started a new
environment, did not normalize, and am
getting the same issue as above

DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied.

Can you run this in a fresh session:

using DataFrames, Impute, GLM, LinearAlgebra

df = DataFrame(x1 = [missing, 4.15, 4.33, missing, 4.4, missing], 
   x2 = [missing, 58.57, 56.94, missing, 49.4, missing], 
   x3 = [3.0, 4.45, 3.71, 2.6, 3.41, missing])

df = Impute.interp(df) |> Impute.locf() |> Impute.nocb()

df_matrix = Matrix(df)

df = DataFrame(normalize(df_matrix, 1000), :auto)

lm(@formula(x3 ~ x1 + x2), df)

With this I get:

julia> lm(@formula(x3 ~ x1 + x2), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

x3 ~ 1 + x1 + x2

Coefficients:
────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   0.21061       0.592283   0.36    0.7457   -1.6743     2.09552
x1           -2.04389       6.12876   -0.33    0.7607  -21.5483    17.4606
x2           -0.00233653    0.168541  -0.01    0.9898   -0.538709   0.534036
────────────────────────────────────────────────────────────────────────────

So the likeliest explanation is that you’re not actually running the code you’ve posted above.

1 Like

The issue I identified prompted by
your response was that one should
not use too many predictors when
evaluating the ols. In my case the
original DF had 13 columns, and I
attempted to apply all of them to
the lm(@formula) instruction.

This post was temporarily hidden by the community for possibly being off-topic, inappropriate, or spammy.

@huang_min

I normalized for presentation purposes
not to change the coefficient outputs from
the ols.

Could you clearly explain 1000 in normalize()? I suspect that you do not ever read the help file.

1000 is simply a scaling parameter. Since each
column in my df is linearly independent, it is less
presentable when I plot the outputs. The value
itself is arbitrary, but ultimately it is intended to
standardize the dataset across the attributes.

This post was temporarily hidden by the community for possibly being off-topic, inappropriate, or spammy.

I think there’s no need to get personal here - I would tend to agree that it would be helpful for OP to consult some introductory statistics/econometrics textbooks to get a better understanding of the methodologies he’s using, but that’s not all that relevant to the concrete question at hand.

What I would humbly ask of you, @YummyPampers2, though is to be respectful of other people’s time and effort spent helping him. Practically this means reading Please read: make it easier to help you and following the advice given there, and in particular making sure that a question is backed up by an example which produces the actual error you’re seeing.

In this case four people have tried helping you with a problem that could not actually be reproduced from the code you posted, and only 24 hours and 17 posts into the thread did you reveal that actually you were running an entirely different regression on different data when you got the error you were asking about. In the event you were lucky that an econometrics professor from one of the leading departments in Europe was on hand to correctly guess what your problem is despite the inadequate MWE, but it’s easy to see how in a slightly different situation the wild goose chase could have gone on for quite some time.

2 Likes

This post was temporarily hidden by the community for possibly being off-topic, inappropriate, or spammy.

@nilshg – thanks for your note.

For this thread, as I expressed, I used
too many attributes for the OLS in my
original formulation. I was able to
adapt what @mcreel had suggested
about degrees of freedom and sample
size, and reduce the dimensions used
to inform the OLS model.

The issue was resolved.

This post was temporarily hidden by the community for possibly being off-topic, inappropriate, or spammy.

FWIW I think this kind of error will be fixed by https://github.com/JuliaStats/GLM.jl/pull/458. Then you will get a coefficients table with infinite standard errors and confidence intervals and p-values equal to 1 for the problematic coefficients. Hopefully this will make it a bit easier to understand what is going on.

3 Likes