Prevent GLM from dropping rows with missings

croberts · January 5, 2023, 7:16pm

The @formula macro allows the construction of variables when creating a model matrix.

This is great! I don’t have to add various transformations of variables to a DataFrame simply because I want to try alternative regression specifications. I can write:
@formula(y/x ~ w)
or
@formula(y - z ~ w + w^2)

But what if I want to run the regression:

GLM.lm( @formula(ismissing(y) ~ x + z) , df)

Uh oh. It doesn’t matter that ismissing(y) is never missing, GLM recognizes that y itself has missing values and drops these rows from the model matrix.

Obviously, I could construct a new variable outside the formula macro:

df[!, :y_is_missing] = ismissing.(df.y)

and run my regression using this. But I don’t want to. Is there any way to turn off the feature that drops missings from the model matrix? A long-term solution would probably have GLM.jl check if the transformed variables are missing, rather than if the variables themselves are missing. In the interim…

tbeason · January 5, 2023, 7:25pm

This seems unlikely to fly based on prior discussions. But

maybe that could work?

Topic		Replies	Views
Missing or NaN Data in GLM (e.g., in DataFrame, @formula) Statistics glm	10	6435	September 12, 2018
@formula for Lathe.preprocess UndefVarError Statistics	12	689	July 29, 2021
Use of StatsModels? Statistics	7	1052	October 31, 2018
PSA: breaking changes in StatsModels v0.6.0 (Terms 2.0: Son of Terms) Statistics	6	1592	July 2, 2019
How to fit a GLM to all (unnamed) features of arbitrary design matrix? Statistics glm	2	1149	February 6, 2019

Prevent GLM from dropping rows with missings

Related topics