After a lengthy development process, StatsModels.jl#71 Terms 2.0: son of Terms has been merged into master. This is a big change in the StatsModels `@formula`

, which describes how to transform tabular data (with heterogenously typed columns) into numerical forms suitable for statistical or other sorts of modeling. The design of Terms 2.0 is more flexible, extensible, and Julian than the original implementation, which was based heavily on R.

While the internal changes are quite substantial, the user-facing interface should be pretty consistent, and any package that didnâ€™t rely on the internals (including GLM.jl) is likely to continue to work, so feel free to check out the master branch (`] dev StatsModels`

) and give it a try.

My JuliaCon 2018 talk summarizes the problem, goals, and proposed implementation, which is largely (but not entirely) what we ended up with. The â€ślatestâ€ť docs describe the new `@formula`

implementation, both from the perspective of users and package developers (or anyone interested in the internals), but Iâ€™ll summarize the changes here too.

# For users

The major change from the perspective of users is that you can now do things like

```
glm(@formula(log(y) ~ 1 + x), data)
```

and it will Just Workâ„˘. What happens under the hood is that the `log(y)`

is transformed into an anonymous function `(y) -> log(y)`

which is applied elementwise to the `y`

column of the data when constructing the model response vector. Functions of multiple variables (`log(a+b)`

and nested functions (`log(abs2(a))`

) are also supported.

The other, subtler change is that because the terms of a `@formula`

are normal Julia structs, itâ€™s now easier to construct a `@formula`

programatically, without the metaprogramming magic that used to be necessary. From the docs:

```
julia> Term(:y) ~ ConstantTerm(1) + Term(:a) + Term(:b) + Term(:a) & Term(:b)
FormulaTerm
Response:
y(unknown)
Predictors:
1
a(unknown)
b(unknown)
a(unknown) & b(unknown)
```

```
julia> ts = term.((1, :a, :b))
1
a(unknown)
b(unknown)
julia> f2 = term(:y) ~ sum(ts)
FormulaTerm
Response:
y(unknown)
Predictors:
1
a(unknown)
b(unknown)
julia> f2 == @formula(y ~ 1 + a + b)
true
```

(The `(unknown)`

reflects the fact that the schema for these terms is not known until data is provided, see the docs on the lifecycle of a `@formula`

for more information.)

# For developers

While this design provides more flexibility for users, the main motivation was to provide a better foundation for building modeling packages that can consume tabular or other heterogeneously typed data. To that end, Terms 2.0 was designed from the ground up to support extensibility using normal Julian mechanisms of polymorphism and multiple dispatch. For another example from the docs, hereâ€™s how you could add syntax for doing polynomial regression:

```
poly(x, n) = x^n
# type of model where syntax applies: here this applies to any model type
const POLY_CONTEXT = Any
# struct for behavior
struct PolyTerm <: AbstractTerm
term::ContinuousTerm
deg::Int
end
# transform calls to poly into PolyTerms
function StatsModels.apply_schema(t::FunctionTerm{typeof(poly)}, sch,
Mod::Type{<:POLY_CONTEXT})
term = apply_schema(t.args_parsed[1], sch, Mod)
isa(term, ContinuousTerm) ||
throw(ArgumentError("PolyTerm only works with continuous terms (got $term)"))
deg = t.args_parsed[2]
isa(deg, ConstantTerm) ||
throw(ArgumentError("PolyTerm degree must be a number (got $deg)"))
PolyTerm(term, deg.n)
end
function StatsModels.modelcols(p::PolyTerm, d::NamedTuple)
col = modelcols(p.term, d)
reduce(hcat, [col.^n for n in 1:p.deg])
end
```

By dispatching in `apply_schema`

on *both* the surface syntax (`typeof(poly)`

) *and* the modeling context, Terms 2.0 allows for package authors to control how and when different kinds of special syntax are interpreted.

If you want to see this in action, see the work that @dmbates is doing with MixedModels.jl (which uses a special syntax like `(1 + a | subject)`

to represent varying intercept/slope for `a`

for each level of `subject`

).

## Light dependency

The other major change, from a package developerâ€™s perspective, is that Terms 2.0 is a much lighter dependency. This is the `REQUIRES`

file:

```
julia 1.0
CategoricalArrays
Tables
StatsBase 0.22.0
DataStructures
Missings
```

Most importantly, Terms 2.0 **drops the DataFrames dependency** that was a legacy of StatsModelsâ€™ origins in DataFrames. StatsModels now uses Tables.jl as a lightweight table standard, and thus is itself a reasonably â€ślightâ€ť dependency that can be added to modeling packages.

As a practical matter, what this means is that package authors are encouraged to use the new Terms 2.0 API (`schema`

, `apply_schema`

, and `modelcols`

) directly, instead of relying on the old API (where StatsModels defines a general fallback method for `fit(::FormulaTerm, data)`

).

## Streaming and distributed data (planned)

The other major design goal of Terms 2.0 was to make it easier (or at least possible) to implement support for streaming and/or distributed modeling workflows. That is done in part by separating the logic of creating a table-to-matrix transformer into three separate steps, based on what information is available and needed:

- Syntax time (only surface syntax
- Schema time (only types/invariants like unique values for categorical variables)
- Data time (actual data values)

This, in principle, allows you to have fine-grained control over how you extract a schema from your data, and how you apply the transformation represented by each `AbstractTerm`

to your data.

I say â€śin principleâ€ť because at the moment weâ€™ve been careful not to do anything that *prevents* supporting streaming/distributed data, even though we donâ€™t widely *support* it yet. Currently, â€śdata timeâ€ť methods are defined for named tuples, representing either an entire table (if the elements are `AbstractVector`

s, i.e. a Tables.ColumnTable) or a single row. In the future, we hope to add in-place â€śdata timeâ€ť methods, and better support for extracting schema from streaming data (e.g., an iterable of named tuples). PRs for this are, of course, enthusiastically welcomed.