I find R's DSL for defining regression a bit mind blowing. GLM.jl and other have tried to "adopt" them. What would an ideal DSL for defining regression look like in Julia?

xiaodai · October 1, 2024, 2:50pm

In R glm(x~-1+y+z) fits a GLM model without an intercept and using y and z as data, so there will be two coefficients from the model fit.

I think GLM.jl has adopted the ~ DSL. But I’ve always found R’s DSL on regression to be whacky. More examples y~x^2 is looking for interactions effects of x and not fitting a coefficient for x squared. To do that you need y~I(x^2).

I don’t know the origin of R’s regression DSL but I find it strange that Julia’s main package has adopted it since it feels so unintuitive.

Are there modern alternatives? I don’t imagine there is a lot of research on the DSL.

pdeffebach · October 1, 2024, 2:53pm

I don’t have any answer from the Julia side, but I, and all other economists know use the excellent fixest package, which has a much more robust DSL and also allows programmatic formula construction. Ideally Julia would look to that better package for a DSL when designing an API.

dlakelan · October 1, 2024, 3:16pm

A lot of people use and know R’s syntax so I’m guessing that’s why it was used in the GLM / @formula macro.

I find the formula macro to be fine for basic models, fitting a line and such, but if I want anything that’s even slightly beyond that, I immediately turn to Turing.jl

TheCedarPrince · October 1, 2024, 4:13pm

I know that @rimhajal and @lrnv 's work on GitHub - JuliaSurv/NetSurvival.jl: A pure-Julia take on standard net survival routines uses similar syntax – just another data point to add to the discussion here.

lrnv · October 1, 2024, 4:32pm

Thanks for the ping @TheCedarPrince. Basically, the implementation of @formula in Julia is far superior to the old R’s DSL, while allowing the same syntax to be kept and extended. Yes, specificities of the syntax might feel odd to a new user, and no I do not know where it originated

But do read its docs there they are well written and highlight certain features such as programmatic constructions of formulas (mentioned in this thread for the competition too), and extensibility.

In JuliaSurv org, we use it to model survival analysis regression such as hazard regressions, matching the old R syntax once more. Allowing people to switch easily was one of the strength of this decision.

ericphanson · October 1, 2024, 4:50pm

This notation is called Wilkinson notation and isn’t only used by R

nilshg · October 1, 2024, 5:27pm

Thank you! I was googling like a madman trying to find the name, even asking llms but couldn’t find it!

nilshg · October 1, 2024, 5:30pm

Peter knows his way around the regression ecosystem in Julia and R (and I assume Stata as well), and without putting words into his mouth what he might have been referring to is that programatically constructing things like fixed effects and instrumental variables is easier in fixest than it is in FixedEffectModels

xiaodai · October 4, 2024, 2:35am

it seems to be just an extension of the existing syntax

pdeffebach · October 4, 2024, 2:31pm

Yeah, basically. It just has more features and great documentation. I like it’s syntax for interaction terms and variable slopes.

I don’t actually construct things programmatically in a fancy way though… I just put a lot of effort into string interpolation.

dave.f.kleinschmidt · October 4, 2024, 5:44pm

we actually intentionally do not include this in StatsModels.jl basic formula DSL. that really is just ~, +, &, and (much to many folks’ chagrin), *.

the design of StatsModels.jl is such that you don’t need to add a bit of syntax to the base DSL in order to able to use it in a specific package. that is, you have to opt-in to any additional syntax by loading a package that defines it. see for instance RegressionFormulae.jl, a tiny package that adds the / syntax for “nested interactions” and the ^n for "all interactions up to the nth degree (the example you called out as wacky).

the syntax extensions can also be scoped by the kind of model that’s been fit, so for instance MixedModels.jl can define | to mean one thing when you’re fitting a mixed model without “squatting” on that syntax for users of other packages, even when they’re loaded simultaneously.

also, like others have said, the notation is pretty universal (there’s support for it via libraries like patsy in python and uh some rust crate that I dug up recently but can’t remember now, although that one was pretty primitive, just main effects).

last thing I’d add is that StatsModels.jl via the “programmatic interface” provides very basic building blocks that anyone can build a better DSL on top of if they wish! the DSL is really just a convenience layer, not something that is in any way required or fundamental to the table-to-matrix transformations that are necessary to fit regression and other models.

Topic		Replies	Views
Is there a better DSL (domain-specific language) for defining a formula in linear models? Statistics	5	1183	April 1, 2019
My experience as a Julia and R user Data dataframes , regression	8	1675	July 1, 2022
Need some information about Regression and GLM packages General Usage regression , glm	1	453	November 29, 2018
Wrappers for GLM Data glm	11	1455	June 19, 2018
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5091	November 26, 2018

I find R's DSL for defining regression a bit mind blowing. GLM.jl and other have tried to "adopt" them. What would an ideal DSL for defining regression look like in Julia?

Related topics