Julia Ecosystem (respecting hierarchy and common API) - Statistical Models

I am developing a longitudinal econometrics package (unobserved effects models). The package works well as it is, but I would like to better integrate it with the Julia ecosystem. However, I have encountered a few issues and would like to point them out:

For regression models in Julia tends to follow this structure:

  • Any {Abstract}
  • StatsBase.StatisticalModel {Abstract}
  • StatsBase.RegressionModel {Abstract} | DataFrames.DataFrameStatisticalModel {Concrete}
  • DataFrames.DataFrameRegressionModel {Concrete} | GLM.LinPredModel {Abstract}

  1. StatsBase works as a great common framework for the most part.
  2. DataFrames.DataFrameRegressionModel seems a bit redundant when it could be handled at StatsBase.RegressionModel. The only benefit I can think of would be for predicting with new data as a DataFrames.DataFrame, but that could be a minor tweak and probably best handled at StatsBase.RegressionModel.
  3. GLM is surprisingly rigid in its schemes. I attempted to place my structs as a child of GLM.LinPredModel, but seems more of a struggle than a benefit.
  4. Packages such as CovarianceMatrices have coded their implementations as an extension of mostly DataFrames.DataFrameRegressionModel and GLM.LinPredModel. Most of the functions are overly restrictive; for example by calling getfield directly rather than using a common API.

The best alternative would be to use a common API for better development and integration following StatsBase design. For example, if one needs to get the response variable of a regression model DO NOT use getfield, rather use StatsBase.model_response to do so. That way if a new development occurs for any child of StatsBase.RegressionModel it can be seemly integrated. It also helps with a consistent definition of terms such as dof or dof_residuals which could be refered as mdf and rdf and in some instances include the intercept or not depending on the preference of the author.

1 Like

@nosferican there are several problems with the overly restrictive typing of StatsBase and GLM. In ‘CovarianceMatrices’ I have tried to get around those which resulted in other restrictive choices. However, if you need to interface your package with CovarianceMatrices we can work together and make it works (giving me the incentive to widening the typing).

I think it could be useful a MetricsBase package that establishes a common API to commonly used econometrics models.

Hi @gragusa I think that would be a welcoming addition. I am familiar with the source code for CovarianceMatrices and my implementation of the HC and CRVE so I think we can probably get a solution that allows good integration among StatsBase, DataFrames, GLM, and similar packages. I will set up my package to play as well as possible as a StatsBase.RegressionModel and that way we can work on making CovarianceMatrices play nice with mine.

As for a potential MetricsBase, I could see it as a child of StatsBase.RegressionModel similar to how StatsBase.RegressionModel is a child of StatsBase.StatisticalModel that adds certain functionalities proper of regression models, but not of all statistical models.

The trick is to be flexible enough to encompass most if all possibilities. For example, my package has endogenous model which are estimated as instrumental variables (2SLS for now). Holdingan extra two matrices for endogenous variables and instruments can be worked to get a similar structure, but Rsq is no longer valid for those models and such should be identified as such. The issue of nobs is one recognized throughout which in my case for example is different from N:= number of cross-sectional units, n:= number of panels when estimating the between estimator. Likewise, those values might change in First-Difference estimator. A very important one is how to calculate the residual_dof (rdf) for the within-estimator with fixed effects models since it must account for the n - 1 additional degrees of freedom which one would ignore if one assumes rdf = nobs - dof (dof = model degree of freedom mdf). The biggest limitation of GLM structs is missing any IV or longitudinal elements and how tight it is. I believe DataFrames.DataFrameRegressionModel regression model is just not that useful for being that restrictive.

Will make a few comments on CovarianceMatrices.jl to get the ball rolling.

The only reason DataFrameRegressionModel exists is to store the information about formula to support predict with a data frame, and to print the names of coefficients from coeftable. Normally you shouldn’t have to deal with it, you just need to make your model type a subtype of StatsBase.StatisticalModel, and everything should work. This system is going to change anyway with the new StatsModels or StreamModels packages. But why do you find it restrictive? If nobs does not apply for your model, feel free to leave that method undefined, or make it return a tuple with (N, n). You can also override the r²/r2 function as you need.

OTOH GLM.jl is rigid because it probably hasn’t been designed with the idea that it could be extended by external packages in mind. But if you have identified things that should be changed, PRs would be welcome.

Aye. I suspected the DataFrames.DataFramesRegressionModel to include those for StatsBase.predict. However, an issue I had was that in some cases the resulting design matrix is not full rank and thus some variables are dropped from the regression and thus I needed to keep track of which variables and estimators are in the final model which differ from those in the DataFrames.ModelFrame. Rather than saving the DataFrames.ModelFrame and DataFrames.ModelMatrix why not make a function that takes a struct of coefficient names and values (maybe even a dictionary), a DataFrames.Formula and a DataFrames.DataFrame for StatsBase.predict? It would probably be more memory efficient and flexible. For my packages, the DataFrames.ModelMatrix is not useful since the whole workhorse is to transform the design matrix. A difference from GLM family and links frameworks which divide the LinPred and Response is that I have to make the transformations based on multiple LinPred and fitted models.

While GLM can just spit an error if the design matrix is not invertible, for longitudinal analysis some estimators will only consider time-variant variables and others all variables. Hence, having a formula which leads to a non-full rank design matrix for some steps is actually valid (one needs to drop redundant variables and then estimate the model) in my case and must be handled differently.

The approach I chose was to make my structs children of StatsBase.RegressionModel which inherits from StatsBase.StatisticalModel and implement all the methods for both. I found the StatsBase.RegressionModel and its parent to be very useful.

I am familiar with the StatsModels package as originally I had coded my package using DataTables and StatsModels until the decision to keep development in DataFrames after the rethinking of NullableArrays for DataTables.

In the case of DataFrames.DataFrameRegressionModel, I wasn’t a fan of having the constructor calling StatsBase.coeftable for output. I much rather took an alternative approach of changing the Base.show of the model constructor to display:
Model has been successfully fitted. To display results use `coeftable(model)`.
This approach allows the user to select which variance-covariance estimator they want to use for the standard errors, t-distribution, Wald-test and \alpha for confint, etc. This might be a better approach for GLM as well now that CovarianceMatrices is available for GLM and GLM implemented F-tests (which could be expanded to a Wald test with the CovarianceMatrices.

GLM code shows that it wasn’t designed to be extended. Something that made sense being one of the earlier packages in the ecosystem. However, if the GLM could be opened up and focus on the transformations based on Family and Link and provide some solvers it could be used by several other packages. I could think of regularized regression packages like Lasso or Net-Elastic playing very nice with GLM.