Is there a better DSL (domain-specific language) for defining a formula in linear models?

I have been an R user and I can see where StatsModel’s @formula macro came from -
R! @formula(x ~ y - 1) where the -1 is to fit without an intercept was unintuitive to me, and also a*b is the same as a + b + a&b i.e. individual and interaction effects. That was also unintuitive, because I actually just wanted a*b. I wonder if there’s a better language for a formula somewhere that someone can point me to? Maybe it’s implemented in Julia.

1 Like

There is no need to rely on your intuition, just read the docs.

https://juliastats.github.io/StatsModels.jl/latest/formula/#Modeling-tabular-data-1

4 Likes

If you’re on master (post #71 Terms 2.0) you can wrap things in identity to block the special syntax (so y ~ identity(a*b - 1) will regress y against one less than the product of a and b).

As a more general comment, there’s always a tension between supporting the DSL features that people who are super familiar with R expect but that others find really unintuitive. I personally hate the “include intercept by default, use 0 or -1 to block” thing but many people would be surprised if that was removed. In #71 the compromise I came up with is that subtypes of AbstractStatisticalModel use the “classical” behavior, but others use the “obvious” behavior (only get what you explicitly ask for at least as far as an intercrpt/constant column goes).

6 Likes

And at an even more general level, one of the primary goals of #71 Terms 2.0 Son of Terms was to make the formula DSL something people could customize and build on top of, instead of a straight clone of the R formula DSL. You could even go as far as writing your own macro that doesn’t do any of the special syntax (is actually just * that’s handled at the macro level) but still returns terms and get all the other benefits of the DSL

5 Likes

@dave.f.kleinschmidt thank you for your understanding and reasonable suggestions. Developing an alternative macro sounds like am interesting option

I’ve come to appreciate when there is a simple way to do things that does not require macros and for that I think a big upgrade to StatsModel has been the possibility to cleanly construct a formula in an explicit, programmatic way: Modeling tabular data · StatsModels.jl

In terms of simplifying the “macro-free” API, maybe it’d be useful to have a vararg interaction function to create interaction terms and a fullinteraction function for interaction terms with also partial interactions (names could probably be improved). Something like:

julia> using StatsModels, IterTools

julia> interaction() = ConstantTerm(1) # product of 0 terms
interaction (generic function with 1 method)

julia> interaction(args...) = mapreduce(term, &, args)
interaction (generic function with 2 methods)

julia> function fullinteraction(args...)
           itr = Iterators.filter(!isempty, IterTools.subsets(args))
           mapreduce(v -> interaction(v...), +, itr)
       end
fullinteraction (generic function with 1 method)

julia> interaction(:a, :b, :c)
a(unknown) & b(unknown) & c(unknown)

julia> fullinteraction(:a, :b, :c)
a(unknown)
b(unknown)
a(unknown) & b(unknown)
c(unknown)
a(unknown) & c(unknown)
b(unknown) & c(unknown)
a(unknown) & b(unknown) & c(unknown)
2 Likes