My experience as a Julia and R user

Hi! I’m a relatively new user of Julia, and I thought that maybe contributors of DataFrames and Econometrics packages could find my feedback useful.

I’ve been primarily using Julia for DataFrames, which is a package I love. Recently, I’ve started to use it for Econometrics. I was originally using Stata, and wanted to move to an open-source alternative. First, I tried R, and didn’t like it at all. So, given my nice experience with Julia, I decided to give it a try for Econometrics.

Given that my experience so far has not been good, I wanted to share it with you. Don’t take this post as some criticism. On the contrary, I want to stick to Julia in the future, since I love how things are evolving. I think the problem is more related to a matter of expectations. Julia language is young, and it’s completely understandable that its packages are still not mature enough. However, people sometimes are over-selling the current state of Julia in some areas. This can be detrimental–users can feel deceived and never come back. In fact, I think discussions like whether Julia is as fast as C or the comparisons with Python are more damaging than helpful.

Let me be more concrete. First, I explored GLM. It still lacks some basic functionalities. For example, adding robust errors to a simple OLS requires adding CovarianceMatrices, which becomes a problem when you want to integrate the result with TexTables.

So, then I moved to FixedEffectModels. I was convinced by the documentation, which shows benchmarks where the package is faster than R and claims that “Performances are roughly similar to the newer R function feols. The main difference is that FixedEffectModels can also run the demeaning operation on a GPU (with method = :gpu).” It gave me the feeling that it was the same or better than R.

I tried FixedEffectModels and was surprised that, despite being in its 1.7 version, I ran into issues like this. Additionally, it couldn’t handle some intensive-computing regressions. Given the problems I was facing, I tried feols in R, given the previous comment on the documentation. I was surprised to see that feols was way faster and could run these intensive-computing regressions, even when it’s in its 0.10.4 version.

I’m pretty sure that the key here is that feols uses all the machine threads for these computations. However, the reason is unimportant. My point is that the comparisons of Julia against alternatives can be dangerous–you can always choose some problem in which one program excels. In fact, the documentation of feols from R shows benchmarks to claim that it’s faster than FixedEffectModels!

So, to make this post a positive contribution, let me share with you some thoughts I have. Again, I don’t know if some of these points are feasible (or even valid!), but maybe they are relevant for certain stuff.

  1. First, let me provide some feedback regarding FixedEffectModels vs feols in R. The following is a MWE that could help identify when the difference between the packages becomes more notorious.
using DataFrames, FixedEffectModels, RCall

# FAKED DATA
nr = 10000
data(year)  = DataFrame(firm = repeat(1:10,Int(nr/10)), industry = repeat(1:10,Int(nr/10)), sales = rand(nr), year = repeat([year],nr))

dff = data(2010) ; [append!(dff,data(yr)) for yr in 2011:2020] 

# REGRESSIONS IN R
R""" 
library(fixest)
rdff = $(dff)
model1_R = feols(sales ~ factor(firm)*factor(year)  | industry, vcov = "hetero", data = rdff) 
"""

# REGRESSIONS IN JULIA
model1_julia = reg(dff, @formula(sales ~ firm*year + fe(industry)), contrasts=Dict([:firm,:year] .=> DummyCoding.()), Vcov.robust());

You can playnr to see that the differences exacerbate. I think this arise because R multithreads the task, since only when I set threads to 1 in both programs is that Julia performs better. I also tried using PooledArrays instead of using contrasts (I don’t know how internally FixedEffectModels treats strings for dummy variables) and didn’t solve it (only if I was using CUDA with method =:gpu, double_precision = false was that Julia was performing better).

  1. I read that some contributors are worried about some features of Julia that new users don’t like. Since I was one of them, let me tell you that I now appreciate some of these features. For instance, I appreciate that packages are more verbose than R, since it makes them way clearer. Or, at first, I was hating the handle of missing values, but then I realized its importance. It taught me the importance of making explicit choices about how to handle data, rather than a package silently deciding what’s best (e.g., automatically skipping values).
    So, I think the best approach here is insisting on why being more verbose, not automatically skipping missing values, etc is important. Over-adapting methods to make packages as R or Stata will be detrimental in the long run. As an economist, and I guess more generally for any non-programmer person, we don’t have a solid background in programming. We end up learning the hard way to avoid some habits. This is even worse when you come from programs like Stata or Matlab. It’s not until you have a bad experience (e.g., an undetected bug) that you understand this.

  2. Sometimes I’m a little bit worried that Julia could run into the same issue as R, where there are 20 thousand packages to do partial parts of the same problem, so you need to learn a little bit of each. Don’t know if it’s possible, but maybe there could be some unification of packages (for example, by Julia officially recommending a specific package for regressions?). In this way, people will focus on improving one package, rather than starting from scratch and proposing their own solution. For example, right now, if you start with GLM and suddenly you want to use a simple OLS with robust errors, you have to add another package or move to FixedEffectModels. So, it’d be ideal if there was only one package doing this (I can imagine that this is really hard/almost impossible to achieve).

Again, I want to emphasize that this wasn’t a criticism directed at any of the contributors of the packages. Rather, my feedback as an user is my way to thank you!
Keep up this good work!

14 Likes

One minor point about benchmarking: the results might be problem-dependant, in which case the results may not generalize to your specific problem. I don’t know enough about the benchmarks, but it looks like they might be using different problems. Do you know if they are using the same problems with the same package versions?

In either case, FixedEffectsModels is a close second in the benchmarks I found.

That’s the second post in short succession that asks about intercepts in a fixed effects model - does feols provide this as a default? plm doesn’t afaict, which is just as well given that in the standard FE formulation an intercept would be perfectly collinear with the fixed effect.

As for the rest of your post I think you misunderstand how SemVer and open source development works - a “1.0” version tag simply means that there are no breaking changes to the API in any “1.x” release, it does not indicate that a package is feature complete or bug free (hardly any software is).

GLM is in fact an older package than FixedEffectModels, and is currently at version 1.8 (not sure why you’re using 0.11?). The workflow for specifying different estimators for standard errors has been the subject of debate, so I wouldn’t call this a “basic missing feature”, but rather a design decision, cf.:

https://github.com/JuliaStats/GLM.jl/issues/42

1 Like

I’m not sure about their benchmarks. I just noticed that when there are categorical variables, FixedEffectModels becomes really slow. But, if you don’t work with categorical variables, FixedEffectModels is faster from what I could see.

The MWE was trying to point out that when you have categorical variables, FixedEffectModels becomes really slow (for instance, this also occurs even if you have no fixed effects).

The GLM part was a typo. I’m using 1.8 (I just corrected it). This was related to my 3rd point, and I don’t know if there’s a solution. My point is that for a user it’s not obvious how to run the simplest regression possible, which is OLS + robust errors. And that this could run into the problem that a lot of people try to write their own package.

At least my preference would be to have all that people normally need when estimating GLM in GLM.jl.

As for DataFrames.jl (and related DataFramesMeta.jl package) - if you would have any feature requests plese open an issue.

Thanks for the feedback! I can’t comment on FixedEffectsModels.jl performance (maybe Matthieu Gomez will chime in). But regarding robust standard erros with GLM.jl, it seems that TexTables.jl supports them using CovarianceMatrices.jl too: Regression API · TexTables.jl.

I agree it’s a bit less convenient than in Stata though. Maybe we could allow passing a covariance estimator to lm/glm/fit so that we use it when printing standard errors, CIs and p-values (as discussed at Allow sandwich/bootstrap/jackknife standard errors as options · Issue #42 · JuliaStats/GLM.jl · GitHub).

Regarding the fact that we have both GLM.jl and FixedEffectsModels.jl, I don’t think it would be a good idea to merge them, as each is intended for a particular use case. Even Stata uses different commands to fit “classic” models and fixed-effects models. But what we should do is ensure they use consistent APIs, which isn’t the case for covariance matrices due to the split between CovarianceMatrices.jl and Vcov.jl (Cc: @ragusa).

3 Likes

As an aside to this, people in the R camp have long favored clustering to happen at the summary stage, rather than the fitting stage. I was pleasantly surprise to learn that fixest supports both, and you can pass vcov = "white" as a keyword argument to feols, as well as summary. I am very much on board with having vcov be a keyword argument in GLM.

I agree that it’s natural to keep GLM and FixedEffectModels separated. The problem is that right now you can’t handle standard errors in GLM, while FixedEffecstModels is still having issues retrieving residuals and predictions (it’d be good to confirm that the issue I reported wasn’t a problem related to my computer, I was using FixedEffectModels v1.6.6). From a user’s point of view, I didn’t want to do so many hacks and, more importantly, I was going to confuse my students. So I ended up using RCall.

I also think that FixedEffectModels could become the standard package for regressions for economists. It’s a great package and ideal to become the building block for users to add more functionalities and work on improvements. In fact, the name of the package can be misleading, because it’s not only about fixed effects—you can use it for instrumental variables and other features. There’s another package called LinearRegressionKit but I haven’t explored it.

I think that some people like me would be eager to contribute to one package and build on it, but right now it’s not easy for a user to know what to use. I think that now is the right time to make a decision about it, because otherwise you risk that people move to another software or write their own package. Regarding the latter, then, you have the problem of R, where there are 1 million packages with a huge overlap between them (e.g., R has 1 million packages to print regressions in latex, with minor differences between them). When actually I think it’s better to have one package when the task is the same, with efforts coordinated.

Thanks for your comments!

2 Likes