Simple Linear Regression: Domain Error with 0.0

Greetings Julians:

I have produced a dataframe whose
columns are all eltype ‘Float64’ as:

Col1 = rand(1:0.01:500,6)
Col2 = rand(1:0.01:500,6)
Col3 = rand(1:0.01:500,6)
Matr = hcat(Col1,Col2,Col3)

I normalized the Matrix as:

using LinearAlgebra
matr_norm = la.normalize(Matr, 1000)

Converted to a dataframe as:

MetroDF = DataFrame(matr_norm, :auto)

I am encountering an issue when I attempted
to generate a regression model with

using GLM
ols = lm(@formula(Col3~Col1+Col2), MetroDF)

The error reads:

Failed to show value:
DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied

Might anyone have an idea how to
address this error?

When you create the data frame, the names are x1, x2, and x3:

julia> MetroDF = DataFrame(matr_norm, :auto)
6×3 DataFrame
 Row │ x1         x2        x3       
     │ Float64    Float64   Float64  
─────┼───────────────────────────────
   1 │ 0.136877   0.223588  0.466813
   2 │ 1.0        0.755881  0.545698
   3 │ 0.534322   0.523622  0.177383
   4 │ 0.566349   0.40899   0.710739
   5 │ 0.674702   0.317932  0.329718
   6 │ 0.0358678  0.567678  0.43628

so you need to call the linear fit as
ols = lm(@formula(x1~x2+x3), MetroDF)

Notice how you construct MetroDF, yet use Metro_DF in the regression. Perhaps you constructed Metro_DF another way that makes something (perhaps a component in the F-test?) go to 0.0, which causes the error. Also see the post above by @mcreel.

Thank you – @mcreel

I followed your approach, however, am wondering
if an imputation I performed to fill missing data
records or a renaming, had some impact.

The MetroDF is about the same, with the only
difference being, the column names and some of
the row values mirroring those nearby. I used:

Impute.interp(OriginalDF) |> Impute.locf() |> Impute.nocb()

There were no missing values after this. However, is
there a chance the lm(@formula…) operation is treating
some value as NaN or 0?

Thank you @amrods – it was a transcription error
but the original workspace did not have this error.
Do you think the methods I addressed to @mcreel
above potentially had some impact? Perhaps the
scaling (1000) I used during the normalization step?

Sorry, can’t tell from this information. You need to provide a MWE as described by Please read: make it easier to help you

1 Like

@mcreel

The original DF (metro) had value set:

image

I converted all columns to float, as a general quality
assurance check using the broadcast float function as:

metro[!, [1,2,3] = float.(metro[!, [1,2,3]])

From here I applied the imputation I described as:

METRO = Impute.interp(metro) |> Impute.locf() |> Impute.nocb()

I converted this METRO to a Matrix as:

METRO_matr = Matrix(METRO)

Followed by normalization as:

using LinearAlgebra
METRO_norm = la.normalize(METRO_matr, 1000) 

Then, I converted the matrix to a dataframe as:

METRO_DF = DataFrame(METRO_norm, :auto)

From here, I applied the GLM commands you
expressed before as:

ols = lm(@formula(x3~x1+x2), METRO_DF)

Which is returning the error I expressed before
as:

DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied

That is saying that the second degrees of freedom of the F(q,n-k) test is not positive. n is the number of observations, and k is the number of regressors, including the constant, 3 in your case. So, it seems that your number of observations is 3 or less. What’s the number of rows of the data frame, after dropping missings? If the screenshot is the entire sample, it is 3, which is in agreement with these comments.

@mcreel – thanks for your explanation here.

The Imputation step I applied does not drop any
missing values, instead it replaces the records
with adjacent values (assuming the observations
are based on the same individual). After the
imputation, there are 6 rows.

The columns must not be linearly independent with this replacement strategy.

How might you troubleshoot this?

There is a large literature, just search for “rank deficient regression”. There’s no clear best solution to the problem.

1 Like

When I run this, it seems to work fine. Can you try running this in a fresh environment to make sure there isn’t something else messing this up?

On a side note, you don’t typically need to normalize in a linear regression like this.

Thank you for your note @junder873

Since each of the columns are linearly
independent, I thought normalization
would not confound the regression
model. You are saying, without this
process step, given this knowledge,
I could generate a sensible model?

Here is a stack overflow answer that does a far better job than I could.

The short version is that an OLS regression really doesn’t care, you could multiply all your values by a billion but the coefficients would stay the same. You can also multiply a single column by any number and the T-stat will remain the same, the coefficient will be scaled by the inverse of what you multiplied it by.

1 Like

Thank you @junder873

What I took away from the Stack Overflow stream was
normalization can help with printability for presentations
but is not altogether necessary, especially on modern
machines that perform some standardization by design.

For a general decision-making reference, OLS is invariant
where normalization will not significantly influence coefficient
values. Alternatively, tests like Ridge or Lasso are variant, so
normalization is encouraged for those and similar test
conditions.

In response to

Can you try running this in a fresh environment

I restarted the Julia session, started a new
environment, did not normalize, and am
getting the same issue as above

DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied.

Can you run this in a fresh session:

using DataFrames, Impute, GLM, LinearAlgebra

df = DataFrame(x1 = [missing, 4.15, 4.33, missing, 4.4, missing], 
   x2 = [missing, 58.57, 56.94, missing, 49.4, missing], 
   x3 = [3.0, 4.45, 3.71, 2.6, 3.41, missing])

df = Impute.interp(df) |> Impute.locf() |> Impute.nocb()

df_matrix = Matrix(df)

df = DataFrame(normalize(df_matrix, 1000), :auto)

lm(@formula(x3 ~ x1 + x2), df)

With this I get:

julia> lm(@formula(x3 ~ x1 + x2), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

x3 ~ 1 + x1 + x2

Coefficients:
────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   0.21061       0.592283   0.36    0.7457   -1.6743     2.09552
x1           -2.04389       6.12876   -0.33    0.7607  -21.5483    17.4606
x2           -0.00233653    0.168541  -0.01    0.9898   -0.538709   0.534036
────────────────────────────────────────────────────────────────────────────

So the likeliest explanation is that you’re not actually running the code you’ve posted above.

1 Like

The issue I identified prompted by
your response was that one should
not use too many predictors when
evaluating the ols. In my case the
original DF had 13 columns, and I
attempted to apply all of them to
the lm(@formula) instruction.

@huang_min

I normalized for presentation purposes
not to change the coefficient outputs from
the ols.