Designing Custom Syntax for @formula

I’m trying to understand the differences between the relatively compact implementation of a custom interpretation of ^ for @formula and the general advice on extending @formula, in particular the example here of how to implement a custom interpretation for poly().

The main source of the difference is that the latter creates a custom Term type, PolyTerm to hold the expression, and then implements a bunch of methods to deal with that type. In contrast, the code for ^ expands the terms out immediately in apply_schema without any reference to new term types.

Why the differences, and which is a better model to use?

Perhaps without something like a PowerToTerm for ^ it would be harder to construct formulae programmatically?


I actually go back and forth on this issue. Originally in StatsModels.jl, ALL of the special syntax stuff happened at parse time, inside the macro. So the formula that comes out at the end of a * b has a + b + a&b and no memory of a * b. Recently we’ve started to move more of that into run time by adding methods for things like Base.:*(a::Term, b::Term) = a + b + a & b, rather than having a transformation that works on the Expr that hte macro sees. IIRC all that stuff is languishing in FunctionTerm is dead, long live FunctionTerm by kleinschmidt · Pull Request #183 · JuliaStats/StatsModels.jl · GitHub and I’ve had some second thoughts in the intervening time. Adding all those methods puts an even bigger burden on the compiler, but using a very differnet approach would require even more dramatic internal (and possibly external) changes (e.g., could be hard to have stuff like term(:a) * term(:b) work without defining those methods).

So, all of which is to say, it’s a design decision, and there’s no obviously correct choice :slight_smile: I’d say that the best starting point is PROBABLY to start with a PolyTerm-like approach, rather than the ^ approach. It’s a lot simpler to get the bookkeeping right if you have a 1-1 match between the input FunctionTerm and the output terms. It’s possible to handle 1-to-many transforms but it can get fiddly (see some of the stuff around / for instance, or how / is handled on the RHS of random effects in MixedModels.jl)