PSA: breaking changes in StatsModels v0.6.0 (Terms 2.0: Son of Terms)

It’s finally time to release version 0.6.0 of StatsModels.jl, which includes the radically re-designed internals introduced in Terms 2.0: Son of Terms (announcement here).

Most of these changes should not affect end users, and the API for interacting with modeling functions (e.g., those provided by GLM.jl) has changed as little as possible.

Package maintainers who rely on any of the pre-Terms 2.0 internals (that is, anyone who hasn’t been working with StatsModels master for the last few months) should upper bound StatsModels. And anyone who’s been waiting to tag a release of their own package can go ahead and lower-bound StatsModels at 0.6.

17 Likes

Of note to users, this version will introduce a few interesting user-facing features, including

  • lead and lag terms (thanks to @oxinabox)
  • support for arbitrary julia code (with some restrictions) within the @formula
  • easier programmatic construction of formulae.

See the docs or the announcement post here for more information on the last two…

4 Likes

Mine broke.

What’s the command to revert to the prior version?

I’ve posted comments in the github Bootstrap.jl issues thread where the Stacktrace points.

using StatsKit
┌ Info: Precompiling StatsKit [2cb19f9e-ec4d-5c53-8573-a4542a68d3f0]
└ @ Base loading.jl:1186
ERROR: LoadError: LoadError: UndefVarError: Formula not defined
using RCall;
┌ Info: Precompiling RCall [6f49c342-dc21-5d91-9882-a32aef131414]
└ @ Base loading.jl:1186
WARNING: could not import StatsModels.Formula into RCall
ERROR: LoadError: LoadError: UndefVarError: Formula not defined

Hi. Thanks for your work! I have a few questions about the new system.

  1. How to only create a ModelMatrix only for rows that do not contain an Inf due to a transformation? For instance, in the following code, the ModelMatrix gets an Inf due to the log transformation:

    using DataFrames
    df = DataFrame( y = [2, 2], x = [0, 1])
    ModelMatrix(ModelFrame(@formula(y~log(x)), df))
    #> ModelMatrix{Array{Float64,2}}([1.0 -Inf; 1.0 0.0], [1, 1])
    

    One way to get a ModelMatrix without the Inf would be to subset the initial ModelMatrix, but this would require the user to have enough RAM to hold 3 diffferent copies of the data. Is it possible to avoid it?

  2. How to create a Term/Tuple of Term from a Expression expr = :(x1 + x2 + x3+x4)? For now I do @eval(@formula(nothing ~ $expr)).rhs but maybe there is a more elegant way to do this?

1 Like

Your second question is a bit easier so I’ll answer it first. You can combine Terms with + (or sum), like

julia> sum(term.([:x1, :x2, :x3, :x4])) |> string
"x1 + x2 + x3 + x4"

If you really need to use an Expr then you’ll have to do some parsing of it yourself, or use @formula.

For 1, you’ll have to do the filtering yourself. You can convert those entries to missing in the underlying data source and they’ll be dropped automatically, or you can post-filter the generated model matrix (but as you said that will require holding separate copies of the data). If you’re really worried about memory usage, you could always stream your data row-wise, call modelcols one row at a time, and re-assemble them on the other side. Something like:

julia> using StatsModels, Tables, DataFrames

julia> df = DataFrame(y = rand(10), x = rand(Bool, 10) .* rand(10))
10×2 DataFrame
│ Row │ y         │ x        │
│     │ Float64   │ Float64  │
├─────┼───────────┼──────────┤
│ 1   │ 0.209279  │ 0.0      │
│ 2   │ 0.424313  │ 0.0      │
│ 3   │ 0.355565  │ 0.372775 │
│ 4   │ 0.298203  │ 0.0      │
│ 5   │ 0.322313  │ 0.0      │
│ 6   │ 0.0742158 │ 0.49498  │
│ 7   │ 0.55074   │ 0.0      │
│ 8   │ 0.747198  │ 0.990947 │
│ 9   │ 0.225106  │ 0.21548  │
│ 10  │ 0.280559  │ 0.363834 │

julia> f = @formula(y ~ x + log(x))
FormulaTerm
Response:
  y(unknown)
Predictors:
  x(unknown)
  (x)->log(x)

julia> # use StatisticalModel as the context to get the intercept term
       f = apply_schema(f, schema(f, df), StatisticalModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  x(continuous)
  (x)->log(x)

julia> y, X = modelcols(f, df)
([0.209279, 0.424313, 0.355565, 0.298203, 0.322313, 0.0742158, 0.55074, 0.747198, 0.225106, 0.280559], [1.0 0.0 -Inf; 1.0 0.0 -Inf; … ; 1.0 0.21548 -1.53489; 1.0 0.363834 -1.01106])

julia> X
10×3 Array{Float64,2}:
 1.0  0.0       -Inf         
 1.0  0.0       -Inf         
 1.0  0.372775    -0.986781  
 1.0  0.0       -Inf         
 1.0  0.0       -Inf         
 1.0  0.49498     -0.703238  
 1.0  0.0       -Inf         
 1.0  0.990947    -0.00909459
 1.0  0.21548     -1.53489   
 1.0  0.363834    -1.01106   

julia> # method 1: using Iterators and reduce:
       allfinite(x) = all(isfinite, Iterators.flatten(x))
allfinite (generic function with 1 method)

julia> y3, X3 = reduce((xy, xyr) -> (append!(xy[1], xyr[1]), append!(xy[2], xyr[2])),
                       Iterators.filter(allfinite, modelcols(f, row) for row in rowtable(df)),
                       init = (Float64[], Float64[]))
([0.355565, 0.0742158, 0.747198, 0.225106, 0.280559], [1.0, 0.372775, -0.986781, 1.0, 0.49498, -0.703238, 1.0, 0.990947, -0.00909459, 1.0, 0.21548, -1.53489, 1.0, 0.363834, -1.01106])

julia> X3 = reshape(X3, 3, :)'
5×3 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 1.0  0.372775  -0.986781  
 1.0  0.49498   -0.703238  
 1.0  0.990947  -0.00909459
 1.0  0.21548   -1.53489   
 1.0  0.363834  -1.01106   

julia> # method 2: using a for loop:
       y2, X2 = zeros(0), zeros(0)
(Float64[], Float64[])

julia> for row in rowtable(df)
           yr, Xr = modelcols(f, row)
           if allfinite((yr, Xr))
               append!(y2, yr)
               append!(X2, Xr)
           end
       end

julia> X2 = reshape(X2, 3, :)'
5×3 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 1.0  0.372775  -0.986781  
 1.0  0.49498   -0.703238  
 1.0  0.990947  -0.00909459
 1.0  0.21548   -1.53489   
 1.0  0.363834  -1.01106   

Unfortunately, as @oxinabox pointed out on slack, this way you’re stuck either using the lazy Adjoint, or using permutedims which copies, because of how a matrix is laid out in memory means you can’t build up a observations-as-rows matrix one row at a time unless you know the size ahead of time.

Thanks for answering these question — I really appreciate it.

For 1, to be clearer, I simply want to obtain a boolean vector marking the rows that are going to be used in the estimation. So, adapting your solution, I get

[all(isfinite, Iterators.flatten(modelcols(f, row))) for row in rowtable(df)]

Could you confirm this is the right way to go about this?

I don’t think there’s really a “right” way of going about this. If all you want is a way of getting a boolean vector of the rows of the model matrix that would have infs, that’s certainly one way to go about it.

But at a broader level, I’m not sure how useful doing it this way would be, since unless I’m missing something you’ll end up needed to generate and index the full model matrix before that vector of bools becomes useful. That’s because it’s up to the modeling side of things to decide which rows get used. If you pass a matrix with Infs to GLM it will happily try use all the rows and give nonsense results:

julia> df
10×2 DataFrame
│ Row │ y         │ x        │
│     │ Float64   │ Float64  │
├─────┼───────────┼──────────┤
│ 1   │ 0.209279  │ 0.0      │
│ 2   │ 0.424313  │ 0.0      │
│ 3   │ 0.355565  │ 0.372775 │
│ 4   │ 0.298203  │ 0.0      │
│ 5   │ 0.322313  │ 0.0      │
│ 6   │ 0.0742158 │ 0.49498  │
│ 7   │ 0.55074   │ 0.0      │
│ 8   │ 0.747198  │ 0.990947 │
│ 9   │ 0.225106  │ 0.21548  │
│ 10  │ 0.280559  │ 0.363834 │

julia> lm(@formula(y ~ 1 + x + log(x)), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x + :(log(x))

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)       NaN         NaN      NaN       NaN        NaN        NaN
x                 NaN         NaN      NaN       NaN        NaN        NaN
log(x)            NaN         NaN      NaN       NaN        NaN        NaN
──────────────────────────────────────────────────────────────────────────

A boolean vector of “good rows” is only going to be useful in such a situation if you use it to create a “clean” model matrix and response vector by indexing the model matrix, in which case you’re probably better off using some kind of mapslices approach since you’re going to need to generate the model matrix anyway.

But I’m afraid I’m not clear on what the overall context of the problem is and can’t really give good advice without that. So maybe better to move this to another thread?