[ANN] Terms 2.0: son of Terms (new `@formula` implementation in StatsModels.jl

statistics
data

#1

After a lengthy development process, StatsModels.jl#71 Terms 2.0: son of Terms has been merged into master. This is a big change in the StatsModels @formula, which describes how to transform tabular data (with heterogenously typed columns) into numerical forms suitable for statistical or other sorts of modeling. The design of Terms 2.0 is more flexible, extensible, and Julian than the original implementation, which was based heavily on R.

While the internal changes are quite substantial, the user-facing interface should be pretty consistent, and any package that didn’t rely on the internals (including GLM.jl) is likely to continue to work, so feel free to check out the master branch (] dev StatsModels) and give it a try.

My JuliaCon 2018 talk summarizes the problem, goals, and proposed implementation, which is largely (but not entirely) what we ended up with. The “latest” docs describe the new @formula implementation, both from the perspective of users and package developers (or anyone interested in the internals), but I’ll summarize the changes here too.

For users

The major change from the perspective of users is that you can now do things like

glm(@formula(log(y) ~ 1 + x), data)

and it will Just Work™. What happens under the hood is that the log(y) is transformed into an anonymous function (y) -> log(y) which is applied elementwise to the y column of the data when constructing the model response vector. Functions of multiple variables (log(a+b) and nested functions (log(abs2(a))) are also supported.

The other, subtler change is that because the terms of a @formula are normal Julia structs, it’s now easier to construct a @formula programatically, without the metaprogramming magic that used to be necessary. From the docs:

julia> Term(:y) ~ ConstantTerm(1) + Term(:a) + Term(:b) + Term(:a) & Term(:b)
FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  a(unknown)
  b(unknown)
  a(unknown) & b(unknown)
julia> ts = term.((1, :a, :b))
1
a(unknown)
b(unknown)

julia> f2 = term(:y) ~ sum(ts)
FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  a(unknown)
  b(unknown)

julia> f2 == @formula(y ~ 1 + a + b)
true

(The (unknown) reflects the fact that the schema for these terms is not known until data is provided, see the docs on the lifecycle of a @formula for more information.)

For developers

While this design provides more flexibility for users, the main motivation was to provide a better foundation for building modeling packages that can consume tabular or other heterogeneously typed data. To that end, Terms 2.0 was designed from the ground up to support extensibility using normal Julian mechanisms of polymorphism and multiple dispatch. For another example from the docs, here’s how you could add syntax for doing polynomial regression:

poly(x, n) = x^n

# type of model where syntax applies: here this applies to any model type
const POLY_CONTEXT = Any

# struct for behavior
struct PolyTerm <: AbstractTerm
    term::ContinuousTerm
    deg::Int
end

# transform calls to poly into PolyTerms
function StatsModels.apply_schema(t::FunctionTerm{typeof(poly)}, sch, 
                                  Mod::Type{<:POLY_CONTEXT})
    term = apply_schema(t.args_parsed[1], sch, Mod)
    isa(term, ContinuousTerm) ||
        throw(ArgumentError("PolyTerm only works with continuous terms (got $term)"))
    deg = t.args_parsed[2]
    isa(deg, ConstantTerm) ||
        throw(ArgumentError("PolyTerm degree must be a number (got $deg)"))
    PolyTerm(term, deg.n)
end

function StatsModels.modelcols(p::PolyTerm, d::NamedTuple)
    col = modelcols(p.term, d)
    reduce(hcat, [col.^n for n in 1:p.deg])
end

By dispatching in apply_schema on both the surface syntax (typeof(poly)) and the modeling context, Terms 2.0 allows for package authors to control how and when different kinds of special syntax are interpreted.

If you want to see this in action, see the work that @dmbates is doing with MixedModels.jl (which uses a special syntax like (1 + a | subject) to represent varying intercept/slope for a for each level of subject).

Light dependency

The other major change, from a package developer’s perspective, is that Terms 2.0 is a much lighter dependency. This is the REQUIRES file:

julia 1.0
CategoricalArrays
Tables
StatsBase 0.22.0
DataStructures
Missings

Most importantly, Terms 2.0 drops the DataFrames dependency that was a legacy of StatsModels’ origins in DataFrames. StatsModels now uses Tables.jl as a lightweight table standard, and thus is itself a reasonably “light” dependency that can be added to modeling packages.

As a practical matter, what this means is that package authors are encouraged to use the new Terms 2.0 API (schema, apply_schema, and modelcols) directly, instead of relying on the old API (where StatsModels defines a general fallback method for fit(::FormulaTerm, data)).

Streaming and distributed data (planned)

The other major design goal of Terms 2.0 was to make it easier (or at least possible) to implement support for streaming and/or distributed modeling workflows. That is done in part by separating the logic of creating a table-to-matrix transformer into three separate steps, based on what information is available and needed:

  1. Syntax time (only surface syntax
  2. Schema time (only types/invariants like unique values for categorical variables)
  3. Data time (actual data values)

This, in principle, allows you to have fine-grained control over how you extract a schema from your data, and how you apply the transformation represented by each AbstractTerm to your data.

I say “in principle” because at the moment we’ve been careful not to do anything that prevents supporting streaming/distributed data, even though we don’t widely support it yet. Currently, “data time” methods are defined for named tuples, representing either an entire table (if the elements are AbstractVectors, i.e. a Tables.ColumnTable) or a single row. In the future, we hope to add in-place “data time” methods, and better support for extracting schema from streaming data (e.g., an iterable of named tuples). PRs for this are, of course, enthusiastically welcomed.