We recently merged a long-standing PR #183 (which some of you may remember from a juliacon talk in 2020
) which is a fairly deep re-thinking of how we handle captured function calls, like the log in @formula(log(y) ~ 1 + x).
For users
Protect/unprotect
This PR introduces a feature first proposed and prototyped by @oxinabox, which allows for designating sub-trees of a @formula expression as βspecialβ or βnot specialβ. This is useful if you want to do something like use & to mean βbitwise ANDβ or (more commonly
) * to mean βmultiplyβ:
julia> @formula(y ~ 1 + protect(a * b))
julia> modelmatrix(f, (a=1:3, b=2:4))
3Γ2 Matrix{Float64}:
1.0 2.0
1.0 6.0
1.0 12.0
The opposite can be done with unprotect to βrestoreβ the @formula-specific interpretation to operators in a sub-tree. To take an extremely contrived example:
julia> f = @formula(y ~ a + b + log(1 + unprotect(a & b)))
FormulaTerm
Response:
y(unknown)
Predictors:
a(unknown)
b(unknown)
(a,b)->log(1 + unprotect(a & b))
julia> modelmatrix(f, (a=["a", "a", "b"], b=2:4))
3Γ3 Matrix{Float64}:
0.0 2.0 0.0
0.0 3.0 0.0
1.0 4.0 1.60944
Here, the last column is the log of one plus the interaction between a and b columns. This reveals a few rough edges in how weβre handling multiple columns being produced (with more columns/levels in a, well, Here Be Dragonsβ¦) but provides at least the syntactic foundations to support these kinds of specials later.
Down the road, we may introduce more convenient syntax than protect() and unprotect() within the @formula macro itself (they are kiiiinda like interpolate/quote, so using $ for protect and : for unprotect might make sense).
Run-time FunctionTerm construction.
Likely the biggest change for users is that it is now at least possible to create FunctionTerms at run-time. Borrowing an example from the docs:
julia> log_term(t::AbstractTerm) = FunctionTerm(log, [t], :(log($(t))))
log_term (generic function with 1 method)
julia> log_term(term(:y))
(y)->log(y)
julia> f = log_term(term(:y)) ~ sum(ts)
FormulaTerm
Response:
(y)->log(y)
Predictors:
1
a(unknown)
b(unknown)
julia> response(f, df)
9-element Vector{Float64}:
-0.5358107653592508
-2.5595706990153952
-0.3331980664948834
-1.1383191195688154
-0.4260357285735626
-1.4412188661761132
-0.34293563140185523
-0.5837776723176953
-2.980055366491228
julia> lm(f, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
:(log(y)) ~ 1 + a + b
Coefficients:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.0698025 0.928295 0.08 0.9425 -2.20165 2.34126
a -0.105669 0.128107 -0.82 0.4410 -0.419136 0.207797
b -1.63199 1.12678 -1.45 0.1977 -4.38911 1.12513
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Which is equivalent to the @formula version:
julia> lm(@formula(log(y) ~ 1 + a + b), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
:(log(y)) ~ 1 + a + b
Coefficients:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.0698025 0.928295 0.08 0.9425 -2.20165 2.34126
a -0.105669 0.128107 -0.82 0.4410 -0.419136 0.207797
b -1.63199 1.12678 -1.45 0.1977 -4.38911 1.12513
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The one major gotcha here is that you have to provide a exorig (as the third argument to the constructor) in order to get nice coefficient names. We may in the future provide some convenience functions to do this in a sensible way but for now if you really need to create custom FunctionTerms itβs at least possible ![]()
For developers
There are substantial internal changes to how FunctionTerms are structured. In the current/previous 0.6.x series, there is a single function term for each non-special call (the whole sub-expression is pulled into teh body of an anonymous function). To take a concrete example, log(1 + a + b) is converted to (a, b) -> log(1 + a + b)
In this PR, we create a whole tree of FunctionTerms. For log(1 + a + b), we generate the following struct:
julia> @formula(y ~ log(1 + a + b)).rhs |> dump
FunctionTerm{typeof(log), Vector{FunctionTerm{typeof(+), Vector{AbstractTerm}}}}
f: log (function of type typeof(log))
args: Array{FunctionTerm{typeof(+), Vector{AbstractTerm}}}((1,))
1: FunctionTerm{typeof(+), Vector{AbstractTerm}}
f: + (function of type typeof(+))
args: Array{AbstractTerm}((3,))
1: ConstantTerm{Int64}
n: Int64 1
2: Term
sym: Symbol a
3: Term
sym: Symbol b
exorig: Expr
head: Symbol call
args: Array{Any}((4,))
1: Symbol +
2: Int64 1
3: Symbol a
4: Symbol b
exorig: Expr
head: Symbol call
args: Array{Any}((2,))
1: Symbol log
2: Expr
head: Symbol call
args: Array{Any}((4,))
1: Symbol +
2: Int64 1
3: Symbol a
4: Symbol b
That is, our new FunctionTerm captures three things:
-
f: the function called (log) -
args: the vector arguments to this call, captured asAbstractTerms (FunctionTerm,Term,ConstantTerm, etc.) -
exorig: the original expression (used for generating coefnames, printing the term, etc.)
The diff in #183 for the extension tests shows the recommended extension pattern for custom syntax:
- have a run-time method for the syntax,
poly(::AbstractTerm, ::AbstractTerm) -> PolyTermwhich creates the necessary special term - in the
apply_schema(ft::FunctionTerm{typeof(poly)}, ...)method, recursively callapply_schemaonft.args, and pass the results topoly.
Implementing this pattern may require some internal changes in downstream dependencies but Iβve found it leads to more seamless and robust support for special syntax BOTH within @formula and at run-time as well.
Release plan
We plan to tag 0.7 soon, and then tag 1.0 a month after that incorporating any additional breaking changes we can get in before then. The current road map is here: roadmap to 1.0 release Β· Issue #271 Β· JuliaStats/StatsModels.jl Β· GitHub
Iβll try to open issues on packages that depend on StatsModels and use FunctionTerm to indicate that somethign more than a simple compat bump may be required. As a starting point, Iβll go through all the matches in JuliaHub for FunctionTerm