PSA: breaking changes in StatsModels v0.6.0 (Terms 2.0: Son of Terms)

I don’t think there’s really a “right” way of going about this. If all you want is a way of getting a boolean vector of the rows of the model matrix that would have infs, that’s certainly one way to go about it.

But at a broader level, I’m not sure how useful doing it this way would be, since unless I’m missing something you’ll end up needed to generate and index the full model matrix before that vector of bools becomes useful. That’s because it’s up to the modeling side of things to decide which rows get used. If you pass a matrix with Infs to GLM it will happily try use all the rows and give nonsense results:

julia> df
10×2 DataFrame
│ Row │ y         │ x        │
│     │ Float64   │ Float64  │
├─────┼───────────┼──────────┤
│ 1   │ 0.209279  │ 0.0      │
│ 2   │ 0.424313  │ 0.0      │
│ 3   │ 0.355565  │ 0.372775 │
│ 4   │ 0.298203  │ 0.0      │
│ 5   │ 0.322313  │ 0.0      │
│ 6   │ 0.0742158 │ 0.49498  │
│ 7   │ 0.55074   │ 0.0      │
│ 8   │ 0.747198  │ 0.990947 │
│ 9   │ 0.225106  │ 0.21548  │
│ 10  │ 0.280559  │ 0.363834 │

julia> lm(@formula(y ~ 1 + x + log(x)), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x + :(log(x))

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)       NaN         NaN      NaN       NaN        NaN        NaN
x                 NaN         NaN      NaN       NaN        NaN        NaN
log(x)            NaN         NaN      NaN       NaN        NaN        NaN
──────────────────────────────────────────────────────────────────────────

A boolean vector of “good rows” is only going to be useful in such a situation if you use it to create a “clean” model matrix and response vector by indexing the model matrix, in which case you’re probably better off using some kind of mapslices approach since you’re going to need to generate the model matrix anyway.

But I’m afraid I’m not clear on what the overall context of the problem is and can’t really give good advice without that. So maybe better to move this to another thread?