I am developing a longitudinal econometrics package (unobserved effects models). The package works well as it is, but I would like to better integrate it with the Julia ecosystem. However, I have encountered a few issues and would like to point them out:
For regression models in Julia tends to follow this structure:
Any {Abstract}
StatsBase.StatisticalModel {Abstract}
-
StatsBase.RegressionModel {Abstract}
| DataFrames.DataFrameStatisticalModel {Concrete}
-
DataFrames.DataFrameRegressionModel {Concrete}
| GLM.LinPredModel {Abstract}
-
StatsBase
works as a great common framework for the most part.
-
DataFrames.DataFrameRegressionModel
seems a bit redundant when it could be handled at StatsBase.RegressionModel
. The only benefit I can think of would be for predicting with new data as a DataFrames.DataFrame
, but that could be a minor tweak and probably best handled at StatsBase.RegressionModel
.
-
GLM
is surprisingly rigid in its schemes. I attempted to place my structs as a child of GLM.LinPredModel
, but seems more of a struggle than a benefit.
- Packages such as
CovarianceMatrices
have coded their implementations as an extension of mostly DataFrames.DataFrameRegressionModel
and GLM.LinPredModel
. Most of the functions are overly restrictive; for example by calling getfield
directly rather than using a common API.
The best alternative would be to use a common API for better development and integration following StatsBase
design. For example, if one needs to get the response variable of a regression model DO NOT use getfield
, rather use StatsBase.model_response
to do so. That way if a new development occurs for any child of StatsBase.RegressionModel
it can be seemly integrated. It also helps with a consistent definition of terms such as dof
or dof_residuals
which could be refered as mdf
and rdf
and in some instances include the intercept or not depending on the preference of the author.
1 Like
@nosferican there are several problems with the overly restrictive typing of StatsBase and GLM. In âCovarianceMatricesâ I have tried to get around those which resulted in other restrictive choices. However, if you need to interface your package with CovarianceMatrices we can work together and make it works (giving me the incentive to widening the typing).
I think it could be useful a MetricsBase package that establishes a common API to commonly used econometrics models.
Hi @gragusa I think that would be a welcoming addition. I am familiar with the source code for CovarianceMatrices
and my implementation of the HC and CRVE so I think we can probably get a solution that allows good integration among StatsBase
, DataFrames
, GLM
, and similar packages. I will set up my package to play as well as possible as a StatsBase.RegressionModel
and that way we can work on making CovarianceMatrices
play nice with mine.
As for a potential MetricsBase
, I could see it as a child of StatsBase.RegressionModel
similar to how StatsBase.RegressionModel
is a child of StatsBase.StatisticalModel
that adds certain functionalities proper of regression models, but not of all statistical models.
The trick is to be flexible enough to encompass most if all possibilities. For example, my package has endogenous model which are estimated as instrumental variables (2SLS for now). Holdingan extra two matrices for endogenous variables and instruments can be worked to get a similar structure, but Rsq
is no longer valid for those models and such should be identified as such. The issue of nobs
is one recognized throughout which in my case for example is different from N:= number of cross-sectional units
, n:= number of panels
when estimating the between estimator. Likewise, those values might change in First-Difference estimator. A very important one is how to calculate the residual_dof (rdf) for the within-estimator with fixed effects models since it must account for the n - 1 additional degrees of freedom which one would ignore if one assumes rdf = nobs - dof (dof = model degree of freedom mdf). The biggest limitation of GLM
structs is missing any IV or longitudinal elements and how tight it is. I believe DataFrames.DataFrameRegressionModel
regression model is just not that useful for being that restrictive.
Will make a few comments on CovarianceMatrices.jl
to get the ball rolling.
The only reason DataFrameRegressionModel
exists is to store the information about formula to support predict
with a data frame, and to print the names of coefficients from coeftable
. Normally you shouldnât have to deal with it, you just need to make your model type a subtype of StatsBase.StatisticalModel
, and everything should work. This system is going to change anyway with the new StatsModels
or StreamModels
packages. But why do you find it restrictive? If nobs
does not apply for your model, feel free to leave that method undefined, or make it return a tuple with (N, n)
. You can also override the r²
/r2
function as you need.
OTOH GLM.jl is rigid because it probably hasnât been designed with the idea that it could be extended by external packages in mind. But if you have identified things that should be changed, PRs would be welcome.
Aye. I suspected the DataFrames.DataFramesRegressionModel
to include those for StatsBase.predict
. However, an issue I had was that in some cases the resulting design matrix is not full rank and thus some variables are dropped from the regression and thus I needed to keep track of which variables and estimators are in the final model which differ from those in the DataFrames.ModelFrame
. Rather than saving the DataFrames.ModelFrame
and DataFrames.ModelMatrix
why not make a function that takes a struct of coefficient names and values (maybe even a dictionary), a DataFrames.Formula
and a DataFrames.DataFrame
for StatsBase.predict
? It would probably be more memory efficient and flexible. For my packages, the DataFrames.ModelMatrix
is not useful since the whole workhorse is to transform the design matrix. A difference from GLM
family and links frameworks which divide the LinPred and Response is that I have to make the transformations based on multiple LinPred and fitted models.
While GLM
can just spit an error if the design matrix is not invertible, for longitudinal analysis some estimators will only consider time-variant variables and others all variables. Hence, having a formula which leads to a non-full rank design matrix for some steps is actually valid (one needs to drop redundant variables and then estimate the model) in my case and must be handled differently.
The approach I chose was to make my structs children of StatsBase.RegressionModel
which inherits from StatsBase.StatisticalModel
and implement all the methods for both. I found the StatsBase.RegressionModel
and its parent to be very useful.
I am familiar with the StatsModels
package as originally I had coded my package using DataTables
and StatsModels
until the decision to keep development in DataFrames
after the rethinking of NullableArrays
for DataTables
.
In the case of DataFrames.DataFrameRegressionModel
, I wasnât a fan of having the constructor calling StatsBase.coeftable
for output. I much rather took an alternative approach of changing the Base.show
of the model constructor to display:
Model has been successfully fitted. To display results use `coeftable(model)`.
This approach allows the user to select which variance-covariance estimator they want to use for the standard errors, t-distribution, Wald-test and \alpha for confint, etc. This might be a better approach for GLM
as well now that CovarianceMatrices
is available for GLM
and GLM
implemented F-tests (which could be expanded to a Wald test with the CovarianceMatrices
.
GLM
code shows that it wasnât designed to be extended. Something that made sense being one of the earlier packages in the ecosystem. However, if the GLM
could be opened up and focus on the transformations based on Family
and Link
and provide some solvers it could be used by several other packages. I could think of regularized regression packages like Lasso
or Net-Elastic
playing very nice with GLM
.