Wrappers for GLM

pdeffebach · June 18, 2018, 4:26pm

Based off of Chris’s comment here this is something that I have been wondering about myself. It would be nice to get some discussion about the regression ecosystem.

We have StatsModels.jl which is used to create forumulas and is centered around the formula object, and then GLM which is used to run regressions and is centered around the model object.

If you make a new MLE estimator, say, a censored tobit model, is it standard practice right now to make sure that

Your function always takes the form censordtobit(f::StatsModels.formula, data<:AbstractDataFrame; args...)
The output is always a GLM.model object such that coef, stderror etc. always work?

Is it desirable to enforce this kind of behavior? Would PRs be welcome to help standardize this across other packages?

Tamas_Papp · June 18, 2018, 5:28pm

This may make sense for a large family of models. See eg the glm function in R for an example.

pdeffebach · June 18, 2018, 5:42pm

With good standards early on, we can improve considerably on R. glm, felm etc. are so different that there is the broom package to tie them all together. But the broom maintainers have the difficult job of having a function for every disparate model output.

I guess this is controversial, however.

FixefEffectsModels.jl seems to export a lighter-weight object while GLM output stores the full dataset in the model. I presume this distinction has been discussed extensively before w.r.t. GLM.jl.

Tamas_Papp · June 18, 2018, 5:57pm

Possibly, but there is a trade-off here:

have an interface that more or less does what other languages do, relatively quickly,
come up with something better, which may take years to polish (similar packages in other languages have a lot of time invested in them).

It is more exiting to do 2. of course, but it does not address the original point you linked from the other thread. Personally, I would go for 2. also, but that means we accept the fact that Julia won’t be as polished for a while as other languages (I am fine with this).

nalimilan · June 18, 2018, 6:56pm

We already have common abstractions in StatsBase: StatisticalModel and RegressionModel. It’s probably incomplete and we can add things progressively as new kinds of models are implemented. There also seems to exist a convention to provide convenience functions like glm which call fit(GeneralizedLinearModel, ...).

Help is always welcome to increase the consistency of the ecosystem, but of course that depends on package maintainers agreeing on the API. Sometimes it takes some work to find what’s the best system to suit everyone’s needs. See for example this StatsModels issue.

This one about the DataFrameRegressionModel wrapper is also relevant.

ExpandingMan · June 18, 2018, 7:05pm

+1 for using StatisticalModel and RegressionModel from StatsBase. StatsBase is already widely used, and this layout is already fairly similar to the scikit-learn interface which I think it’s safe to say has proven itself.

It would be really nice if all regressions could implement this interface and we could expand the standard if necessary.

I’m less fond of the StatsModels approach as it seems to depend on data formats being tabular.

pdeffebach · June 18, 2018, 7:07pm

Thanks for the links. This is exactly what I was looking for. It looks like I am a long ways away from being able to make a PR. And this is a complicated topic that doesnt have easy solutions. I’ll retreat into the woodwork and try out the ecosystem more for a while before getting ahead of myself.

pdeffebach · June 18, 2018, 7:23pm

Side note, you know what would be cool, is a glm function that takes a dataframe as its first input, so you could do

out = @> df begin
       @where(:x4 .> .5)
       lm(m)
end

nalimilan · June 18, 2018, 7:33pm

Yes, that’s the whole point, but it’s not incompatible with the more basic approach based on matrices. On the contrary, they are complementary. Ideally new model implementations wouldn’t have to depend on DataFrames (and maybe not even on StatsModels) at all. We’re not there yet but that’s doable.

I see the point of this, but it seems to me that formulas usually come first… I guess providing two equivalent methods with the DataFrame in two different positions wouldn’t hurt since types are unambiguous.

Nosferican · June 19, 2018, 2:27am

I have quite a bit of experience on this issue… my take has been to use the StatsBase abstraction and API fully. StatsModels is a great tool and I rely on it heavily (you can support any tabular format with IterableTables). In the end, I decided not to wrap GLM and implement a different optimized Fisher Scoring algorithm (GLM uses the Cholesky variant and I prefer the QR three triangular variant… also my package only does canonical links that dispatch on the type of the outcome variable). For a sad story about depending on GLM, see CovarianceMatrices which is a great package, but sadly was implemented for GLM and was of limited benefit (author working currently on generalizing it to work with StatsBase abstraction).

From a developer standpoint, many models such must be aware of several issues GLM does not keep track of. For example, correct residual degrees of freedom with absorbing fixed effects. GLM is basically a convenient package for vanilla models. For more complicated models I would use dedicated packages… shameful advertisement, one chapter of my dissertation is Econometrics.jl (will be developed by 0.7) which includes many of these models.

pdeffebach · June 19, 2018, 2:52am

Wow this is an incredible piece of work! I’ll definitely use it as a resource in grad school. all of these packages are making me remember the gap between applied economics and econometrics.

From an end user perspective, I suppose all they care about is if they can use coef(output) and stderror(output), and that the input formula at least looks somewhat like @formula(...). But I can see how the myriad of complexities would make a single API difficult to use for everything.

nalimilan · June 19, 2018, 7:32am

Not only these. It’s often useful to be able to use e.g. nobs, deviance, loglikelihood and so on.

Topic		Replies	Views
Julia Ecosystem (respecting hierarchy and common API) - Statistical Models Internals & Design	4	1385	July 19, 2017
Use of StatsModels? Statistics	7	1059	October 31, 2018
Multivariate OLS General Usage	17	5835	November 13, 2018
My experience as a Julia and R user Data dataframes , regression	8	1676	July 1, 2022
I find R's DSL for defining regression a bit mind blowing. GLM.jl and other have tried to "adopt" them. What would an ideal DSL for defining regression look like in Julia? Modelling & Simulations regression	10	347	October 4, 2024

Wrappers for GLM

Related topics