Wrappers for GLM

Based off of Chris’s comment here this is something that I have been wondering about myself. It would be nice to get some discussion about the regression ecosystem.

We have StatsModels.jl which is used to create forumulas and is centered around the formula object, and then GLM which is used to run regressions and is centered around the model object.

If you make a new MLE estimator, say, a censored tobit model, is it standard practice right now to make sure that

  1. Your function always takes the form censordtobit(f::StatsModels.formula, data<:AbstractDataFrame; args...)
  2. The output is always a GLM.model object such that coef, stderror etc. always work?

Is it desirable to enforce this kind of behavior? Would PRs be welcome to help standardize this across other packages?

This may make sense for a large family of models. See eg the glm function in R for an example.

With good standards early on, we can improve considerably on R. glm, felm etc. are so different that there is the broom package to tie them all together. But the broom maintainers have the difficult job of having a function for every disparate model output.

I guess this is controversial, however.

FixefEffectsModels.jl seems to export a lighter-weight object while GLM output stores the full dataset in the model. I presume this distinction has been discussed extensively before w.r.t. GLM.jl.

Possibly, but there is a trade-off here:

  1. have an interface that more or less does what other languages do, relatively quickly,
  2. come up with something better, which may take years to polish (similar packages in other languages have a lot of time invested in them).

It is more exiting to do 2. of course, but it does not address the original point you linked from the other thread. Personally, I would go for 2. also, but that means we accept the fact that Julia won’t be as polished for a while as other languages (I am fine with this).

1 Like

We already have common abstractions in StatsBase: StatisticalModel and RegressionModel. It’s probably incomplete and we can add things progressively as new kinds of models are implemented. There also seems to exist a convention to provide convenience functions like glm which call fit(GeneralizedLinearModel, ...).

Help is always welcome to increase the consistency of the ecosystem, but of course that depends on package maintainers agreeing on the API. Sometimes it takes some work to find what’s the best system to suit everyone’s needs. See for example this StatsModels issue.

This one about the DataFrameRegressionModel wrapper is also relevant.

2 Likes

+1 for using StatisticalModel and RegressionModel from StatsBase. StatsBase is already widely used, and this layout is already fairly similar to the scikit-learn interface which I think it’s safe to say has proven itself.

It would be really nice if all regressions could implement this interface and we could expand the standard if necessary.

I’m less fond of the StatsModels approach as it seems to depend on data formats being tabular.

1 Like

Thanks for the links. This is exactly what I was looking for. It looks like I am a long ways away from being able to make a PR. And this is a complicated topic that doesnt have easy solutions. I’ll retreat into the woodwork and try out the ecosystem more for a while before getting ahead of myself.

Side note, you know what would be cool, is a glm function that takes a dataframe as its first input, so you could do

out = @> df begin
       @where(:x4 .> .5)
       lm(m)
end

Yes, that’s the whole point, but it’s not incompatible with the more basic approach based on matrices. On the contrary, they are complementary. Ideally new model implementations wouldn’t have to depend on DataFrames (and maybe not even on StatsModels) at all. We’re not there yet but that’s doable.

I see the point of this, but it seems to me that formulas usually come first… I guess providing two equivalent methods with the DataFrame in two different positions wouldn’t hurt since types are unambiguous.

1 Like

I have quite a bit of experience on this issue… my take has been to use the StatsBase abstraction and API fully. StatsModels is a great tool and I rely on it heavily (you can support any tabular format with IterableTables). In the end, I decided not to wrap GLM and implement a different optimized Fisher Scoring algorithm (GLM uses the Cholesky variant and I prefer the QR three triangular variant… also my package only does canonical links that dispatch on the type of the outcome variable). For a sad story about depending on GLM, see CovarianceMatrices which is a great package, but sadly was implemented for GLM and was of limited benefit (author working currently on generalizing it to work with StatsBase abstraction).

From a developer standpoint, many models such must be aware of several issues GLM does not keep track of. For example, correct residual degrees of freedom with absorbing fixed effects. GLM is basically a convenient package for vanilla models. For more complicated models I would use dedicated packages… shameful advertisement, one chapter of my dissertation is Econometrics.jl (will be developed by 0.7) which includes many of these models.

1 Like

Wow this is an incredible piece of work! I’ll definitely use it as a resource in grad school. all of these packages are making me remember the gap between applied economics and econometrics.

From an end user perspective, I suppose all they care about is if they can use coef(output) and stderror(output), and that the input formula at least looks somewhat like @formula(...). But I can see how the myriad of complexities would make a single API difficult to use for everything.

Not only these. It’s often useful to be able to use e.g. nobs, deviance, loglikelihood and so on.

1 Like