Simple Linear Regression: Domain Error with 0.0

YummyPampers2 · March 30, 2022, 3:38am

Greetings Julians:

I have produced a dataframe whose
columns are all eltype ‘Float64’ as:

Col1 = rand(1:0.01:500,6)
Col2 = rand(1:0.01:500,6)
Col3 = rand(1:0.01:500,6)
Matr = hcat(Col1,Col2,Col3)

I normalized the Matrix as:

using LinearAlgebra
matr_norm = la.normalize(Matr, 1000)

Converted to a dataframe as:

MetroDF = DataFrame(matr_norm, :auto)

I am encountering an issue when I attempted
to generate a regression model with

using GLM
ols = lm(@formula(Col3~Col1+Col2), MetroDF)

The error reads:

Failed to show value:
DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied

Might anyone have an idea how to
address this error?

mcreel · March 30, 2022, 4:54am

When you create the data frame, the names are x1, x2, and x3:

julia> MetroDF = DataFrame(matr_norm, :auto)
6×3 DataFrame
 Row │ x1         x2        x3       
     │ Float64    Float64   Float64  
─────┼───────────────────────────────
   1 │ 0.136877   0.223588  0.466813
   2 │ 1.0        0.755881  0.545698
   3 │ 0.534322   0.523622  0.177383
   4 │ 0.566349   0.40899   0.710739
   5 │ 0.674702   0.317932  0.329718
   6 │ 0.0358678  0.567678  0.43628

so you need to call the linear fit as
ols = lm(@formula(x1~x2+x3), MetroDF)

amrods · March 30, 2022, 5:00am

Notice how you construct MetroDF, yet use Metro_DF in the regression. Perhaps you constructed Metro_DF another way that makes something (perhaps a component in the F-test?) go to 0.0, which causes the error. Also see the post above by @mcreel.

YummyPampers2 · March 30, 2022, 10:38am

Thank you – @mcreel

I followed your approach, however, am wondering
if an imputation I performed to fill missing data
records or a renaming, had some impact.

The MetroDF is about the same, with the only
difference being, the column names and some of
the row values mirroring those nearby. I used:

Impute.interp(OriginalDF) |> Impute.locf() |> Impute.nocb()

There were no missing values after this. However, is
there a chance the lm(@formula…) operation is treating
some value as NaN or 0?

YummyPampers2 · March 30, 2022, 10:40am

Thank you @amrods – it was a transcription error
but the original workspace did not have this error.
Do you think the methods I addressed to @mcreel
above potentially had some impact? Perhaps the
scaling (1000) I used during the normalization step?

mcreel · March 30, 2022, 11:19am

Sorry, can’t tell from this information. You need to provide a MWE as described by Please read: make it easier to help you

YummyPampers2 · March 30, 2022, 5:04pm

@mcreel

The original DF (metro) had value set:

I converted all columns to float, as a general quality
assurance check using the broadcast float function as:

metro[!, [1,2,3] = float.(metro[!, [1,2,3]])

From here I applied the imputation I described as:

METRO = Impute.interp(metro) |> Impute.locf() |> Impute.nocb()

I converted this METRO to a Matrix as:

METRO_matr = Matrix(METRO)

Followed by normalization as:

using LinearAlgebra
METRO_norm = la.normalize(METRO_matr, 1000)

Then, I converted the matrix to a dataframe as:

METRO_DF = DataFrame(METRO_norm, :auto)

From here, I applied the GLM commands you
expressed before as:

ols = lm(@formula(x3~x1+x2), METRO_DF)

Which is returning the error I expressed before
as:

DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied

mcreel · March 30, 2022, 5:29pm

That is saying that the second degrees of freedom of the F(q,n-k) test is not positive. n is the number of observations, and k is the number of regressors, including the constant, 3 in your case. So, it seems that your number of observations is 3 or less. What’s the number of rows of the data frame, after dropping missings? If the screenshot is the entire sample, it is 3, which is in agreement with these comments.

YummyPampers2 · March 30, 2022, 5:38pm

@mcreel – thanks for your explanation here.

The Imputation step I applied does not drop any
missing values, instead it replaces the records
with adjacent values (assuming the observations
are based on the same individual). After the
imputation, there are 6 rows.

mcreel · March 30, 2022, 5:47pm

The columns must not be linearly independent with this replacement strategy.

YummyPampers2 · March 30, 2022, 5:56pm

How might you troubleshoot this?

mcreel · March 30, 2022, 6:07pm

There is a large literature, just search for “rank deficient regression”. There’s no clear best solution to the problem.

junder873 · March 30, 2022, 6:12pm

When I run this, it seems to work fine. Can you try running this in a fresh environment to make sure there isn’t something else messing this up?

On a side note, you don’t typically need to normalize in a linear regression like this.

YummyPampers2 · March 30, 2022, 6:22pm

Thank you for your note @junder873

Since each of the columns are linearly
independent, I thought normalization
would not confound the regression
model. You are saying, without this
process step, given this knowledge,
I could generate a sensible model?

junder873 · March 30, 2022, 6:40pm

Here is a stack overflow answer that does a far better job than I could.

The short version is that an OLS regression really doesn’t care, you could multiply all your values by a billion but the coefficients would stay the same. You can also multiply a single column by any number and the T-stat will remain the same, the coefficient will be scaled by the inverse of what you multiplied it by.

YummyPampers2 · March 30, 2022, 6:52pm

Thank you @junder873

What I took away from the Stack Overflow stream was
normalization can help with printability for presentations
but is not altogether necessary, especially on modern
machines that perform some standardization by design.

For a general decision-making reference, OLS is invariant
where normalization will not significantly influence coefficient
values. Alternatively, tests like Ridge or Lasso are variant, so
normalization is encouraged for those and similar test
conditions.

In response to

Can you try running this in a fresh environment

I restarted the Julia session, started a new
environment, did not normalize, and am
getting the same issue as above

DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied.

nilshg · March 30, 2022, 7:58pm

Can you run this in a fresh session:

using DataFrames, Impute, GLM, LinearAlgebra

df = DataFrame(x1 = [missing, 4.15, 4.33, missing, 4.4, missing], 
   x2 = [missing, 58.57, 56.94, missing, 49.4, missing], 
   x3 = [3.0, 4.45, 3.71, 2.6, 3.41, missing])

df = Impute.interp(df) |> Impute.locf() |> Impute.nocb()

df_matrix = Matrix(df)

df = DataFrame(normalize(df_matrix, 1000), :auto)

lm(@formula(x3 ~ x1 + x2), df)

With this I get:

julia> lm(@formula(x3 ~ x1 + x2), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

x3 ~ 1 + x1 + x2

Coefficients:
────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   0.21061       0.592283   0.36    0.7457   -1.6743     2.09552
x1           -2.04389       6.12876   -0.33    0.7607  -21.5483    17.4606
x2           -0.00233653    0.168541  -0.01    0.9898   -0.538709   0.534036
────────────────────────────────────────────────────────────────────────────

So the likeliest explanation is that you’re not actually running the code you’ve posted above.

YummyPampers2 · March 31, 2022, 1:00am

The issue I identified prompted by
your response was that one should
not use too many predictors when
evaluating the ols. In my case the
original DF had 13 columns, and I
attempted to apply all of them to
the lm(@formula) instruction.

YummyPampers2 · March 31, 2022, 1:40am

@huang_min

I normalized for presentation purposes
not to change the coefficient outputs from
the ols.

Topic		Replies	Views
Normalization and Linear Model NaN error? Statistics	3	1047	December 3, 2021
Can someone replicate this GLM problem with linear regression on your computer? Statistics linearalgebra , glm	13	851	April 27, 2021
Linear regression with a positive definite matrix in GLM.jl? Statistics glm	11	2736	February 1, 2019
World Age Error with fit(LinearModel, @(Y ~ X), data) with GLS and DataFrame Package New to Julia	1	637	December 8, 2017
Error when using fit(LinearModel,...) through GLM General Usage glm	1	589	August 19, 2021

Simple Linear Regression: Domain Error with 0.0

Related topics