We recently merged a long-standing PR #183 (which some of you may remember from a juliacon talk in 2020 ) which is a fairly deep re-thinking of how we handle captured function calls, like the log
in @formula(log(y) ~ 1 + x)
.
For users
Protect/unprotect
This PR introduces a feature first proposed and prototyped by @oxinabox, which allows for designating sub-trees of a @formula
expression as βspecialβ or βnot specialβ. This is useful if you want to do something like use &
to mean βbitwise ANDβ or (more commonly ) *
to mean βmultiplyβ:
julia> @formula(y ~ 1 + protect(a * b))
julia> modelmatrix(f, (a=1:3, b=2:4))
3Γ2 Matrix{Float64}:
1.0 2.0
1.0 6.0
1.0 12.0
The opposite can be done with unprotect
to βrestoreβ the @formula
-specific interpretation to operators in a sub-tree. To take an extremely contrived example:
julia> f = @formula(y ~ a + b + log(1 + unprotect(a & b)))
FormulaTerm
Response:
y(unknown)
Predictors:
a(unknown)
b(unknown)
(a,b)->log(1 + unprotect(a & b))
julia> modelmatrix(f, (a=["a", "a", "b"], b=2:4))
3Γ3 Matrix{Float64}:
0.0 2.0 0.0
0.0 3.0 0.0
1.0 4.0 1.60944
Here, the last column is the log of one plus the interaction between a
and b
columns. This reveals a few rough edges in how weβre handling multiple columns being produced (with more columns/levels in a
, well, Here Be Dragonsβ¦) but provides at least the syntactic foundations to support these kinds of specials later.
Down the road, we may introduce more convenient syntax than protect()
and unprotect()
within the @formula
macro itself (they are kiiiinda like interpolate/quote, so using $
for protect
and :
for unprotect
might make sense).
Run-time FunctionTerm
construction.
Likely the biggest change for users is that it is now at least possible to create FunctionTerm
s at run-time. Borrowing an example from the docs:
julia> log_term(t::AbstractTerm) = FunctionTerm(log, [t], :(log($(t))))
log_term (generic function with 1 method)
julia> log_term(term(:y))
(y)->log(y)
julia> f = log_term(term(:y)) ~ sum(ts)
FormulaTerm
Response:
(y)->log(y)
Predictors:
1
a(unknown)
b(unknown)
julia> response(f, df)
9-element Vector{Float64}:
-0.5358107653592508
-2.5595706990153952
-0.3331980664948834
-1.1383191195688154
-0.4260357285735626
-1.4412188661761132
-0.34293563140185523
-0.5837776723176953
-2.980055366491228
julia> lm(f, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
:(log(y)) ~ 1 + a + b
Coefficients:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.0698025 0.928295 0.08 0.9425 -2.20165 2.34126
a -0.105669 0.128107 -0.82 0.4410 -0.419136 0.207797
b -1.63199 1.12678 -1.45 0.1977 -4.38911 1.12513
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Which is equivalent to the @formula
version:
julia> lm(@formula(log(y) ~ 1 + a + b), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
:(log(y)) ~ 1 + a + b
Coefficients:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.0698025 0.928295 0.08 0.9425 -2.20165 2.34126
a -0.105669 0.128107 -0.82 0.4410 -0.419136 0.207797
b -1.63199 1.12678 -1.45 0.1977 -4.38911 1.12513
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The one major gotcha here is that you have to provide a exorig
(as the third argument to the constructor) in order to get nice coefficient names. We may in the future provide some convenience functions to do this in a sensible way but for now if you really need to create custom FunctionTerm
s itβs at least possible
For developers
There are substantial internal changes to how FunctionTerm
s are structured. In the current/previous 0.6.x series, there is a single function term for each non-special call (the whole sub-expression is pulled into teh body of an anonymous function). To take a concrete example, log(1 + a + b)
is converted to (a, b) -> log(1 + a + b)
In this PR, we create a whole tree of FunctionTerm
s. For log(1 + a + b)
, we generate the following struct:
julia> @formula(y ~ log(1 + a + b)).rhs |> dump
FunctionTerm{typeof(log), Vector{FunctionTerm{typeof(+), Vector{AbstractTerm}}}}
f: log (function of type typeof(log))
args: Array{FunctionTerm{typeof(+), Vector{AbstractTerm}}}((1,))
1: FunctionTerm{typeof(+), Vector{AbstractTerm}}
f: + (function of type typeof(+))
args: Array{AbstractTerm}((3,))
1: ConstantTerm{Int64}
n: Int64 1
2: Term
sym: Symbol a
3: Term
sym: Symbol b
exorig: Expr
head: Symbol call
args: Array{Any}((4,))
1: Symbol +
2: Int64 1
3: Symbol a
4: Symbol b
exorig: Expr
head: Symbol call
args: Array{Any}((2,))
1: Symbol log
2: Expr
head: Symbol call
args: Array{Any}((4,))
1: Symbol +
2: Int64 1
3: Symbol a
4: Symbol b
That is, our new FunctionTerm
captures three things:
-
f
: the function called (log
) -
args
: the vector arguments to this call, captured asAbstractTerm
s (FunctionTerm
,Term
,ConstantTerm
, etc.) -
exorig
: the original expression (used for generating coefnames, printing the term, etc.)
The diff in #183 for the extension tests shows the recommended extension pattern for custom syntax:
- have a run-time method for the syntax,
poly(::AbstractTerm, ::AbstractTerm) -> PolyTerm
which creates the necessary special term - in the
apply_schema(ft::FunctionTerm{typeof(poly)}, ...)
method, recursively callapply_schema
onft.args
, and pass the results topoly
.
Implementing this pattern may require some internal changes in downstream dependencies but Iβve found it leads to more seamless and robust support for special syntax BOTH within @formula
and at run-time as well.
Release plan
We plan to tag 0.7 soon, and then tag 1.0 a month after that incorporating any additional breaking changes we can get in before then. The current road map is here: roadmap to 1.0 release Β· Issue #271 Β· JuliaStats/StatsModels.jl Β· GitHub
Iβll try to open issues on packages that depend on StatsModels and use FunctionTerm
to indicate that somethign more than a simple compat bump may be required. As a starting point, Iβll go through all the matches in JuliaHub for FunctionTerm