Google Summer of Code Proposal: Econometrics.jl

Greetings y’all.

I am applying for the Google Summer of Code with Julia.

My proposal draft is here: Proposal. Any comments or suggestions are welcomed.

Thanks!

  • Bayoán

PS: Latest Draft Version can be accessed here.

1 Like

Gave some specific comments. But in general, the grammar is a little rocky at times: you may want to have it double checked before submitting.

Also, always use specifics. If you say that an API is good for … , show it. Don’t expect the reader to already know about the packages you’re referring to. If you say something will be done “early summer”, commit to something. Give a general timeline (and justify it).

I hope this helps!

1 Like

Thanks for the comments. I read over them and they seem quite helpful. It is 04:00 local time so I might work on addressing them after getting some shut-eye. I will be working on making a more detailed proposal based on the journal article for the plm package. Most the article is explaining the need for having a package for econometricians that differ from the mixed models framework which is the whole raison d’etre and why the proposal. As a side, as an economist, it took me a couple reading of that article to finally understand the differences between the mixed models and econometrics framework. MixedModels.jl was the mixed models R package ported to Julia. The proposal is essentially at its core porting the econometrics R package to Julia.

Sounds interesting (sociologist here). I agree with @ChrisRackauckas’s comments. It would be good to point to code you wrote in Julia, or code in other languages which would be closer to what you’ll do in the project (i.e. relatively complex structure with several types in interaction, methods, tests…).

Another remark is that you should take into account that the Julia and R communities and package ecosystems are not organized in the same way. There is usually very little (or even no) collaboration and coordination between R package authors, and the base modeling features are frozen. On the contrary, in Julia the goal is to ensure packages share the same framework as much as possible, and the base modeling features are still being developed (first in DataFrames, now in StatsModels, as well as in StatsBase).

So your project should not consist only in copying the design of the plm R package in isolation. You should rather work to integrate missing generic features in base modeling packages and develop frameworks which allow combining packages. For example, robust standard errors are needed for many different model families, so they could be implemented in a generic fashion so that all packages can use it (see for example the CovarianceMatrices package). Also, plm seems to be using a custom pdata.frame type which is a data frame with an individual identifier column. In Julia, this should be a more generic wrapper type which could be useful for other packages (or an addition to DataFrames/DataTables), e.g. so that diff/lag give correct results automatically. Formulas might need some additions (to DataFrames/StatsModels) to support the complex syntax used by plm.

I think your proposal would be more appealing if you can give concrete details about what the Julia implementation will look like, and show that you understand what will need to be changed in plm’s design to suit Julia. For example, how will the API allow choosing which kind of model to estimate? Maybe create a type for each family, and pass it as the first argument to fit?

Finally, it’s not clear to me how much your project would be able to re-use estimation methods provided by GLM/MixedModels. For a summer, you won’t be able to implement many estimation algorithms on your own. Designing the interfaces would already be a lot of work.

3 Likes

I don’t have Julia code for projects that close. The most similar experience would be Python code I developed in the Machine Learning Coursera Specialization for University of Washington (for example, stochastic gradient descent, logistic regression, clustering algorithms, etc.). However, I shouldn’t post the implementations as there are still students taking those courses and alumni are discouraged to post online solutions to the material. Might check how refined the wikipedia page for those are and decide if there are some that could be shared.

I completely agree on your point. Any API I develop would highly benefit from using the same sharing ecosystem and being able to interact properly with it. For tabular data it would use the DataFrames structure, GLM objects and methods, @formula from StatsBase, etc. The CovarianceMatrix is a good example. As a side, I have worked with my own @tapply -like function in Julia as it is a common tool in R which I haven’t found in Julia. If you do know a good implementation of it let me know as it is core to panel data procedures as you pointed out with diff/lag. Even if developed originally for this package it should eventually find a better home and make it general enough for general use (i.e., vector, index, function). I would argue it is worth considering eventually adding to Base as it is pretty common and general.

As for the estimation methods, the approach taken by plm is to make transformations to the model matrix and use OLS for it. Using this approach I don’t have to design the estimation technique themselves, but rather just implement the transformation procedures to the model matrix and I can use the basic functions already provided in GLM.

Hi,

this sounds like a useful applied project. Just a few remarks (from a finance prof teaching financial econometrics):

(1) It is mostly about panel data methods, so maybe another name would fit better? (PanelEconometrics.jl?)
(2) The project seems to be a mix of providing an interface to existing code and to bring in truly new stuff. A more detailed discussion of this balance would be helpful.
(3) Most of the methods are some sort of linear regressions, so the coding should be relatively straightforward. The main challenge is probably to “economize” with the memory space (panel data sets are often huge) at the same time as handling missing data in an efficient way. Some more details on that would strengthen the proposal.

Good Luck,
Paul S

1 Like

It would be good if you also clarified how this relates to https://github.com/matthieugomez/FixedEffectModels.jl.

Good point. I had just finished doing so before I saw the comment. Thanks for pointing it out.

This kind of code could be interesting. You could give a link to them e.g. in a secret gist, so that you can remove them after a few weeks. Chances are very small that students would find them since they wouldn’t be indexed by search engines.

I think the issue with tapply is that it encourages a vectorization mindset which is the only way to write (semi-)efficient code in R, but which is suboptimal in Julia since it creates lots of temporary arrays (one for each group). Depending on what you need to do, it can be much faster to go over the vector and apply the function on the fly: for example to compute a sum by groups, for each element see in what group it should go and add it to the accumulator for that group. That could make your package quite faster and less memory-hungry than plm.

1 Like

I just realized that I had lost my code for those when my previous computer died (and also erased my external hard-drive). I learned my lesson and now everything on my computer is back up to the cloud. Thanks for the suggestion. What I did was to explained the algorithms I had implemented during the courses.

The closest I implementation I have seen is the DataFrames by() which is based on the Split-Apply-Combine strategies. However it is only implemented for dataframes and not arrays. I believe I explored implementing new structures to hold the group values, map() with aware functions, and a few other approaches.

I think before going deep into implementing new tools an serious effort should be made to design a comprehensive interface that all the econometrics models can build upon. This can be done either by extending StatsBase.jl or by creating a new interface package (e.g., MetricsBase.jl).

For instance, it is true that the cited CovarianceMatrices.jl can be extended to different estimators, but this would be much easier if the interface to do so would be in a thin separate package that could be imported by other packages.

I believe the current framework so far is DataFrames → StatsModel → GLM.
After GLM the ecosystem could use more work in the design.

  • Machine Learning (Lasso.jl adds regularized regressions such as Ridge, Lasso, and net-elastic)
  • Mixed Models.jl adds mixed models
  • CovarianceMatrices.jl is one that has a nice niche in the ecosystem
  • Others I would like to see in the future are: LinearRegressionDiagnostics, SpatialRegression, TimeSeries (there are a couple packages, but not the package), SurveyData, etc.

At least with the proposed package it can start adapting the framework after GLM to start making it easier to develop within the ecosystem by integrating CovarianceMatrices.jl. In the future maybe Lasso.jl and such. I would like to work on a LinearRegressionDiagnostics eventually and just adapt Econometrics.jl to play well depending on the data structure.

Hey @Nosferican, did you ever start work on this package? Would be nice to have in the Julia ecosystem.

@joshualeond, I am currently beta testing it: https://github.com/JuliaEconometrics/UEM.jl. Any feedback, suggestions or bug reporting is appreciated!

How does your package interact or support FixedEffectsModels.jl?

https://github.com/matthieugomez/FixedEffectModels.jl

I am finalizing the latest iteration of the package. FixedEffectModels is similar to Stata’s reghdfe or R’s lfm. This package allows fixed effects the same way and also is similar as it uses an implementation of the Method of Alternating Projections (initially it used Baltagi’s two-way fixed effects N x N matrix). The differences is that this package has a few other estimators for panel data (e.g., between means, between periods, and random effects GLS Swamy-Arora). Other than that there are a few changes in the implementation of the vcov multi-way clustering, provided statistics, etc. FixedEffectModels gives a small results object for memory efficiency. This package holds most of the data in its struct so it may be used later on in a package specific to implementing various tests (consistency, spherical errors, strength of instruments, etc.).

Is there any discussion of implementing some aspects of your package with FixedEffectsModels? To avoid a fracturing of the ecosystem like we currently have in R?

A lot of the discussion is in StatsModels. You can take a look at issue21. A lot of it won’t be solidified until JuliaData/DataFrames and JuliaStats/StatsModels get a stable version that works again. There are a few packages with somewhat similar aspects, but hopefully everyone will move to a standard routine with JuliaStats/StatsBase and JuliaStats/StatsModels working with JuliaData/DataFrames. The standard routine will make sure that the package structs are <: StatsBase.RegressionModel, use StatsBase.fit!, have all relevant functions from the abstract type defined and can be summarized with StatsBase.coeftable. Other integration such as for variance covariance matrices will hopefully be done as well. For the moment CovarianceMatrices only works for GLM and the DataFrames statistical model is being depreciated. There will still be some differences such as MixedModels.jl since that is targeted to a different crowd. The Econometrics.jl suite will focus on the regression models we use and hopefully integrate it with some Machine Learning eventually, but will be kept with the things that econometricians need/want/use.

Current development is happening at Econometrics.jl.

2 Likes

Could you outline your current thinking of what bits of the econometrics world Econometrics.jl is aimed at covering, and how it interacts with the rest of the ecosystem as it stands now?

I have to say I find the name slightly unhelpful, as it conveys a slightly misleading impression of the scope of the package - as I understand it, it currently focuses on panel methods, and doesn’t attempt to uncorporate significant other subfields (eg time series, nonparametrics, structural, quasi-experimental stuff like synthetic control [shameless plug]). Don’t get me wrong the package looks very useful, but I wonder whether a more specific name might be helpful!?