StatsModels.jl: upcoming breaking changes to captured functions calls

dave.f.kleinschmidt · January 24, 2023, 5:32pm

We recently merged a long-standing PR #183 (which some of you may remember from a juliacon talk in 2020 ) which is a fairly deep re-thinking of how we handle captured function calls, like the log in @formula(log(y) ~ 1 + x).

For users

Protect/unprotect

This PR introduces a feature first proposed and prototyped by @oxinabox, which allows for designating sub-trees of a @formula expression as “special” or “not special”. This is useful if you want to do something like use & to mean ‘bitwise AND’ or (more commonly ) * to mean ‘multiply’:

julia> @formula(y ~ 1 + protect(a * b))

julia> modelmatrix(f, (a=1:3, b=2:4))
3×2 Matrix{Float64}:
 1.0   2.0
 1.0   6.0
 1.0  12.0

The opposite can be done with unprotect to “restore” the @formula-specific interpretation to operators in a sub-tree. To take an extremely contrived example:

julia> f = @formula(y ~ a + b + log(1 + unprotect(a & b)))
FormulaTerm
Response:
  y(unknown)
Predictors:
  a(unknown)
  b(unknown)
  (a,b)->log(1 + unprotect(a & b))

julia> modelmatrix(f, (a=["a", "a", "b"], b=2:4))
3×3 Matrix{Float64}:
 0.0  2.0  0.0
 0.0  3.0  0.0
 1.0  4.0  1.60944

Here, the last column is the log of one plus the interaction between a and b columns. This reveals a few rough edges in how we’re handling multiple columns being produced (with more columns/levels in a, well, Here Be Dragons…) but provides at least the syntactic foundations to support these kinds of specials later.

Down the road, we may introduce more convenient syntax than protect() and unprotect() within the @formula macro itself (they are kiiiinda like interpolate/quote, so using $ for protect and : for unprotect might make sense).

Run-time `FunctionTerm` construction.

Likely the biggest change for users is that it is now at least possible to create FunctionTerms at run-time. Borrowing an example from the docs:

julia> log_term(t::AbstractTerm) = FunctionTerm(log, [t], :(log($(t))))
log_term (generic function with 1 method)

julia> log_term(term(:y))
(y)->log(y)

julia> f = log_term(term(:y)) ~ sum(ts)
FormulaTerm
Response:
  (y)->log(y)
Predictors:
  1
  a(unknown)
  b(unknown)

julia> response(f, df)
9-element Vector{Float64}:
 -0.5358107653592508
 -2.5595706990153952
 -0.3331980664948834
 -1.1383191195688154
 -0.4260357285735626
 -1.4412188661761132
 -0.34293563140185523
 -0.5837776723176953
 -2.980055366491228
 
julia> lm(f, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

:(log(y)) ~ 1 + a + b

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.0698025    0.928295   0.08    0.9425  -2.20165    2.34126
a            -0.105669     0.128107  -0.82    0.4410  -0.419136   0.207797
b            -1.63199      1.12678   -1.45    0.1977  -4.38911    1.12513
──────────────────────────────────────────────────────────────────────────

Which is equivalent to the @formula version:

julia> lm(@formula(log(y) ~ 1 + a + b), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

:(log(y)) ~ 1 + a + b

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.0698025    0.928295   0.08    0.9425  -2.20165    2.34126
a            -0.105669     0.128107  -0.82    0.4410  -0.419136   0.207797
b            -1.63199      1.12678   -1.45    0.1977  -4.38911    1.12513
──────────────────────────────────────────────────────────────────────────

The one major gotcha here is that you have to provide a exorig (as the third argument to the constructor) in order to get nice coefficient names. We may in the future provide some convenience functions to do this in a sensible way but for now if you really need to create custom FunctionTerms it’s at least possible

For developers

There are substantial internal changes to how FunctionTerms are structured. In the current/previous 0.6.x series, there is a single function term for each non-special call (the whole sub-expression is pulled into teh body of an anonymous function). To take a concrete example, log(1 + a + b) is converted to (a, b) -> log(1 + a + b)

In this PR, we create a whole tree of FunctionTerms. For log(1 + a + b), we generate the following struct:

julia> @formula(y ~ log(1 + a + b)).rhs |> dump
FunctionTerm{typeof(log), Vector{FunctionTerm{typeof(+), Vector{AbstractTerm}}}}
  f: log (function of type typeof(log))
  args: Array{FunctionTerm{typeof(+), Vector{AbstractTerm}}}((1,))
    1: FunctionTerm{typeof(+), Vector{AbstractTerm}}
      f: + (function of type typeof(+))
      args: Array{AbstractTerm}((3,))
        1: ConstantTerm{Int64}
          n: Int64 1
        2: Term
          sym: Symbol a
        3: Term
          sym: Symbol b
      exorig: Expr
        head: Symbol call
        args: Array{Any}((4,))
          1: Symbol +
          2: Int64 1
          3: Symbol a
          4: Symbol b
  exorig: Expr
    head: Symbol call
    args: Array{Any}((2,))
      1: Symbol log
      2: Expr
        head: Symbol call
        args: Array{Any}((4,))
          1: Symbol +
          2: Int64 1
          3: Symbol a
          4: Symbol b

That is, our new FunctionTerm captures three things:

f: the function called (log)
args: the vector arguments to this call, captured as AbstractTerms (FunctionTerm, Term, ConstantTerm, etc.)
exorig: the original expression (used for generating coefnames, printing the term, etc.)

The diff in #183 for the extension tests shows the recommended extension pattern for custom syntax:

have a run-time method for the syntax, poly(::AbstractTerm, ::AbstractTerm) -> PolyTerm which creates the necessary special term
in the apply_schema(ft::FunctionTerm{typeof(poly)}, ...) method, recursively call apply_schema on ft.args, and pass the results to poly.

Implementing this pattern may require some internal changes in downstream dependencies but I’ve found it leads to more seamless and robust support for special syntax BOTH within @formula and at run-time as well.

Release plan

We plan to tag 0.7 soon, and then tag 1.0 a month after that incorporating any additional breaking changes we can get in before then. The current road map is here: roadmap to 1.0 release · Issue #271 · JuliaStats/StatsModels.jl · GitHub

I’ll try to open issues on packages that depend on StatsModels and use FunctionTerm to indicate that somethign more than a simple compat bump may be required. As a starting point, I’ll go through all the matches in JuliaHub for FunctionTerm

Topic		Replies	Views
[ANN] Terms 2.0: son of Terms (new `@formula` implementation in StatsModels.jl Package Announcements statistics , data	0	1566	March 10, 2019
Extending the `@formula` syntax to generate functions Statistics	3	529	July 20, 2020
PSA: breaking changes in StatsModels v0.6.0 (Terms 2.0: Son of Terms) Statistics	6	1592	July 2, 2019
Using all independent variables with @formula in a multiple linear model New to Julia glm	18	4424	January 29, 2023
Designing Custom Syntax for @formula Statistics question , statsmodels	1	398	May 13, 2022