# StatsModels.jl: upcoming breaking changes to captured functions calls

We recently merged a long-standing PR #183 (which some of you may remember from a juliacon talk in 2020 ) which is a fairly deep re-thinking of how we handle captured function calls, like the `log` in `@formula(log(y) ~ 1 + x)`.

## For users

### Protect/unprotect

This PR introduces a feature first proposed and prototyped by @oxinabox, which allows for designating sub-trees of a `@formula` expression as “special” or “not special”. This is useful if you want to do something like use `&` to mean ‘bitwise AND’ or (more commonly ) `*` to mean ‘multiply’:

``````julia> @formula(y ~ 1 + protect(a * b))

julia> modelmatrix(f, (a=1:3, b=2:4))
3×2 Matrix{Float64}:
1.0   2.0
1.0   6.0
1.0  12.0
``````

The opposite can be done with `unprotect` to “restore” the `@formula`-specific interpretation to operators in a sub-tree. To take an extremely contrived example:

``````julia> f = @formula(y ~ a + b + log(1 + unprotect(a & b)))
FormulaTerm
Response:
y(unknown)
Predictors:
a(unknown)
b(unknown)
(a,b)->log(1 + unprotect(a & b))

julia> modelmatrix(f, (a=["a", "a", "b"], b=2:4))
3×3 Matrix{Float64}:
0.0  2.0  0.0
0.0  3.0  0.0
1.0  4.0  1.60944
``````

Here, the last column is the log of one plus the interaction between `a` and `b` columns. This reveals a few rough edges in how we’re handling multiple columns being produced (with more columns/levels in `a`, well, Here Be Dragons…) but provides at least the syntactic foundations to support these kinds of specials later.

Down the road, we may introduce more convenient syntax than `protect()` and `unprotect()` within the `@formula` macro itself (they are kiiiinda like interpolate/quote, so using `\$` for `protect` and `:` for `unprotect` might make sense).

### Run-time `FunctionTerm` construction.

Likely the biggest change for users is that it is now at least possible to create `FunctionTerm`s at run-time. Borrowing an example from the docs:

``````julia> log_term(t::AbstractTerm) = FunctionTerm(log, [t], :(log(\$(t))))
log_term (generic function with 1 method)

julia> log_term(term(:y))
(y)->log(y)

julia> f = log_term(term(:y)) ~ sum(ts)
FormulaTerm
Response:
(y)->log(y)
Predictors:
1
a(unknown)
b(unknown)

julia> response(f, df)
9-element Vector{Float64}:
-0.5358107653592508
-2.5595706990153952
-0.3331980664948834
-1.1383191195688154
-0.4260357285735626
-1.4412188661761132
-0.34293563140185523
-0.5837776723176953
-2.980055366491228

julia> lm(f, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

:(log(y)) ~ 1 + a + b

Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.0698025    0.928295   0.08    0.9425  -2.20165    2.34126
a            -0.105669     0.128107  -0.82    0.4410  -0.419136   0.207797
b            -1.63199      1.12678   -1.45    0.1977  -4.38911    1.12513
──────────────────────────────────────────────────────────────────────────
``````

Which is equivalent to the `@formula` version:

``````julia> lm(@formula(log(y) ~ 1 + a + b), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

:(log(y)) ~ 1 + a + b

Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.0698025    0.928295   0.08    0.9425  -2.20165    2.34126
a            -0.105669     0.128107  -0.82    0.4410  -0.419136   0.207797
b            -1.63199      1.12678   -1.45    0.1977  -4.38911    1.12513
──────────────────────────────────────────────────────────────────────────
``````

The one major gotcha here is that you have to provide a `exorig` (as the third argument to the constructor) in order to get nice coefficient names. We may in the future provide some convenience functions to do this in a sensible way but for now if you really need to create custom `FunctionTerm`s it’s at least possible ## For developers

There are substantial internal changes to how `FunctionTerm`s are structured. In the current/previous 0.6.x series, there is a single function term for each non-special call (the whole sub-expression is pulled into teh body of an anonymous function). To take a concrete example, `log(1 + a + b)` is converted to `(a, b) -> log(1 + a + b)`

In this PR, we create a whole tree of `FunctionTerm`s. For `log(1 + a + b)`, we generate the following struct:

``````julia> @formula(y ~ log(1 + a + b)).rhs |> dump
FunctionTerm{typeof(log), Vector{FunctionTerm{typeof(+), Vector{AbstractTerm}}}}
f: log (function of type typeof(log))
args: Array{FunctionTerm{typeof(+), Vector{AbstractTerm}}}((1,))
1: FunctionTerm{typeof(+), Vector{AbstractTerm}}
f: + (function of type typeof(+))
args: Array{AbstractTerm}((3,))
1: ConstantTerm{Int64}
n: Int64 1
2: Term
sym: Symbol a
3: Term
sym: Symbol b
exorig: Expr
args: Array{Any}((4,))
1: Symbol +
2: Int64 1
3: Symbol a
4: Symbol b
exorig: Expr
args: Array{Any}((2,))
1: Symbol log
2: Expr
args: Array{Any}((4,))
1: Symbol +
2: Int64 1
3: Symbol a
4: Symbol b
``````

That is, our new `FunctionTerm` captures three things:

• `f`: the function called (`log`)
• `args`: the vector arguments to this call, captured as `AbstractTerm`s (`FunctionTerm`, `Term`, `ConstantTerm`, etc.)
• `exorig`: the original expression (used for generating coefnames, printing the term, etc.)

The diff in #183 for the extension tests shows the recommended extension pattern for custom syntax:

• have a run-time method for the syntax, `poly(::AbstractTerm, ::AbstractTerm) -> PolyTerm` which creates the necessary special term
• in the `apply_schema(ft::FunctionTerm{typeof(poly)}, ...)` method, recursively call `apply_schema` on `ft.args`, and pass the results to `poly`.

Implementing this pattern may require some internal changes in downstream dependencies but I’ve found it leads to more seamless and robust support for special syntax BOTH within `@formula` and at run-time as well.

### Release plan

We plan to tag 0.7 soon, and then tag 1.0 a month after that incorporating any additional breaking changes we can get in before then. The current road map is here: roadmap to 1.0 release · Issue #271 · JuliaStats/StatsModels.jl · GitHub

I’ll try to open issues on packages that depend on StatsModels and use `FunctionTerm` to indicate that somethign more than a simple compat bump may be required. As a starting point, I’ll go through all the matches in JuliaHub for `FunctionTerm`

8 Likes