Choosing a numerical programming language for economic research: Julia,

Stata is completely specialized to the exact workflow of applied linear (and maybe constrained variations of generalized linear) estimation doing things like linear regression, causal inference, event studies, etc It would be insane to do things like solve differential equations in it, much more than even R. The stata language itself is absurd and inconsistent in ways you couldn’t imagine (e.g. in practice, you only have one (unnamed) table available at any time and no other datastructures. Variable names implicitly reference columns in that table). The presenter of https://www.youtube.com/watch?v=vcFBwt1nu2U&ab_channel=NDCConferences should study Stata for inspiration.

But… it is a great case study of how an objectively awful and inconsistent language can develop an environment, workflow, network of users, and set of packages that does exactly what a group of people need. And this isn’t just a question of switching costs and complementarities from network effects: focus often dominates elegance when productivity is concerned.

Unlike using R (let alone stata) to solve a complicated differential equations, it isn’t insane to use Julia for data analysis and linear regressions. Johannes, Mathieu and many others have created excellent packages. If you and your coauthors/RAs already know Julia well - or have those tasks as just one small piece of your project - then they are a great choice. I think what economists should be careful doing, though, is suggesting to people that they consider switching over to Julia for just those tasks from their more specialized language. It isn’t where Julia’s has its main advantages and if people evaluate Julia primarily on those criteria it will lose and they may not come back to evaluate it again in those cases where it would win.

I am not sure what the myths are at this point.

Certainly the julia compiler and language still have usability quirks but it is much better and has far less latency than 2019. The development environment is better than it used to be, but not in the same ballpark as the stability of Rstudio, matlab, or even the vscode python support. Given the massive investment in those platforms that shouldn’t be a surprise.

Sadly, the package ecosystem hasn’t consistently kept pace. For some (e.g. DifferentialEquations and ApproxFun to throw out a few) they are the state-of-the-art and well maintained. But many more basic ones have stagnated and when new users find a package they often find it has bugs, holes, or they need to switch between similar packages for small variations on the same task. Python/R also have packages which slowly depreciate - but they have a core set of packages that never will. Similarly, the investment into packages in python numpy/jax/pytorch and matlab has accelerated, and the performance and usability of those has changed substantially. numba/jax/pytorch can be at least as fast as Julia in many cases (or at least until careful performance tweaking happens… all compiled languages end up the same), and a lot of algorithms are linear-algebra dominated so performance comes down to BLAS code. Performance isn’t everything, of course, but it is a dimension that Julia cannot assume it will win at.

I feel like there is a degree of complacency in the Julia community where people think that it is catching up to its competitors and it is just a question of time. In some cases there is catch up, in others Julia is already the state-of-the-art, but the other package ecosystem of its competitors are also moving targets and have huge $$$ invested in them.

So to summarize my position here, and then I will shut up, the biggest myths which impede Julia’s success might be in the Julia community itself: that (1) being the best language is enough to win; (2) that Julia’s packages are catching up to the competition in all the important cases that matter; and (3) that just because Julia is a general purpose language that can do many things, it can’t backfire to evangelize it to those currently using more specialized languages for those tasks.

As always, I am leaving off all of the wonderful things about julia. I am only pointing this out publicly because I think there are things that can be done to remedy these issues because they won’t magically disappear without something changing.

As for actionable items: besides ensuring the perspective that having the best language is neither necessary nor sufficient for success (which hopefully the success of Stata/Matlab/R is enough of a counterfactual) and that the entire workflow/environment/ecosystem is the key, I think the most important step is to ensure that Julia remains the best at its core competencies like scientific computing.

Specifically, if I could make one suggestion to anyone to address the holes here, it is to look at the SciML project: https://docs.sciml.ai/dev/ which intends to both consolidate julia scientific package documentation, fill in holes in the ecosystem and provide a common interface. There are still many holes, unit tests, examples, bugs in downstream packages, and lots of polish required, but I think it is Julia’s best hope for having an easy to navigate and dependable ecosystem.

Even better: if any economists are really interested in filling in those holes, apply for some significant open-source grants to do so. Scipy/numpy/Stan/etc. got where they are today because of investment in time and money.

9 Likes

I want to go farther than this and say it is actually more pleasant to do the data analysis tasks I’ve been doing in Julia than it is in R. The kinds of things I do are read in publicly available data in CSV, XLSX, or very occasionally some other format, filter, join, parse, calculate statistics, plot, fit simple linear models, fit full Bayesian models in Turing, and build entire documents explaining the workflow in Weave.

I usually sit down to work on a problem and start by making a .jmd for weave to operate on, then just start cranking on writing code and making plots. I interweave between code groups a little discussion motivating what I’m going to do next “Let’s see what happens if …” kind of text. Doing it in VSCode is acceptably convenient and has nice graph capture etc, and when I have something I like I can weave() it into an html or pdf output as a single document.

Working with data manipulation using DataFramesMeta is quite pleasant and I 100% know what I’m getting when I do it unlike tidyverse and its tons of nonstandard evaluation.

When I have to do something a little unusual, like when I wanted to read through the ACS data and whenever I read a particular kind of household record I create an entry for each person in the household putting their entry under a particular key in a dictionary based on some of their characteristics, and then do statistics on the people grouped together under the same key… that was immediately easy to code and performant. This kind of stuff would be absolutely annoying as hell in R.

So I’m mostly pushing back against the idea that Julia is only “reasonably ok” for data munging. It is actually the best thing I’ve seen because I have complete access to a full programming language and can do whatever I want, not just a few ritualized things that there’s special syntax for.

If what you do is handled entirely by a special purpose tool like STATA, then sticking to that is maybe ok (though I have strong feelings about the “search for statistical significance” style of work that STATA seems to be designed to enable), but if what you want to do involves a variety of activities at all… like maybe you plan to data munge and then use the results of survey data to populate an agent based model, or to calculate information about migration patterns and display on maps, or to estimate the rate of change of environmental issues in different locations and use that to inform coefficients in an ODE model that predicts population crashes among amphibians or whatever. You should absolutely be looking hard at Julia and do not think that its data munging abilities are second rate.

EDIT: all that being said, I still agree with you that there’s no room for complacency in Julia ecosystem. We need more documentation, we need package maintainers, and we need forward progress on supporting more data formats, more interactivity, more online data sources, etc.

9 Likes

Thanks for updating your post and the reference.
I have a few remarks though.

  1. @turbo is not a decorator (that’s a python term). In Julia it is a macro.
  2. I disagree with your assessment that it is unfair to use SIMD, especially since you don’t know what Numba.@jit is doing in the background. As stated by the authors it could be that it uses some optimizations from LLVM that could also be SIMD. What I agree with you though, is that in order to use it you need to rewrite your algo a bit (see next point). If you just throw in an extra keyword in front and it makes your algorithm 2-3X faster, who cares if it uses SIMD or not? To my knowledge, the other numerical languages don’t have access to SIMD which gives Julia an edge which I think should not be disregarded so easily because it is “unfair”. I would also argue that it is “unfair” to compare such a young language (10+ years) with others that have been around for 30+ years, or one that is commercial against open source ones. They all have their advantages and disadvantages and SIMD is one that Julia has that the others don’t.
  3. Currently the @turbo macro from LoopVectorization requires an algorithmic rewrite for the likelihood() calculation which is not ideal. However, the LoopVectorization package is currently being reorganized / rewritten and was told by the package author (Chris Elrod) that this case is going to be included on the test cases.
3 Likes

my two cents: if one splits the loop in two like in the Julia’s @turbo implementation, the vectorization of that loop in C requires just finding the appropriate compiler flag (--march=native maybe).

Here are a few data analysis pieces that are available in other environments. Some of these will likely be more developed in a few years and some won’t.

In no particular order:

  • Pandas has hierarchical indexing on both axes; DataFrames.jl doesn’t have it and won’t any time soon.
  • DataFrames.jl 1.3 is missing a number of features, in development (eg #2215, #3116)
  • TSx.jl (time series) is unreleased/immature
  • VCov.jl isn’t integrated into all of the modeling packages
  • brms is very well developed; TuringGLM.jl is new this year and likely immature.
  • There is no plm (panel regression) in Julia
4 Likes

Have a look at my blog post: Working with Julia projects | Julia programming notes
Does that clarify something for you, or what do you still not understand?

2 Likes

Maybe my recent perspective as a new user with no background in computation could be of some help. I used to exclusively use Stata until about half a year ago (I was only performing simple regressions). Since I only have Stata at my office, I decided to move to an open-source alternative. So, I had to face the choice between R, Python, and Julia.

I want to emphasize a comment by @jlperla: economists (yet) do not have formal training in computation. In the end, I feel this is like learning to play guitar by yourself, where bad habits compound over time–you only learn what you shouldn’t have done ex post. So far, there’s no systematic approach to choosing/being instructed on how to learn these tools.

Given this, let me tell you two aspects I concluded from my experience. I call them aspects rather than lessons, because it just describes my experience.

  1. First, I wanted some software to perform regressions and simple statistic analysis.
    You can read more about my experience in this regard here. I started with R and got confused by the plethora of packages. So, then I moved to Julia. Never tried Python, because I read many times on the net that Julia was the future, and bought that.

My conclusion? I would’ve suggested myself sticking to R for these purposes. Why? My suggestion is just due to practical purposes: packages are more mature and you have more options. Also, when people develop new tools, they do it in Stata or R. For regressions or really simple analysis, you don’t care about speed or other advantages that Julia has.
Nonetheless, there’s a catch here : DataFrames is excellent. In fact, my motive to keep using Julia was that when I had to clean and organize data, DataFrames is superb. But knowing the eventual price, maybe I had to stick to R. But, as I try to emphasize, I don’t know. Maybe once I went to R, I would’ve switched back to Julia.
What I’m doing now? Like a guitar player with bad habits, I’m using Julia and RCall when I perform regression analysis. Don’t know if this is optimal, since using RCall for some purposes is not seamless.

Lesson from this → as a new user, I was confused. I wasted a lot of time without finding an answer to what I should use, and quite likely ended up using a suboptimal method.

  1. Just this week, I’ve started using Julia for other stuff than just pure regression stuff. Mainly, calibrating and solving models.
    Here, I only tried Julia, since I thought that Julia could be the right choice (I still think it is). But I’m facing issues, because the documentation/posts are not well suited for beginners. Let me give you an example.
    I want to solve a simple non-linear system of equations (think of a typical matrix 3x3 arising from a CES demand and Pareto distributions of productivities, I mainly do International Trade). I was able to show mathematically that the solution exists, is unique, and interior. Great! now how can I find the solution? I’m still not sure which package I should use for such a simple stuff. All the documentation/posts I find start from contrived examples or focus on tricks to increase speed. This is eventually important, but as a beginner, I still feel like crying out “let me know where to start from!”
    Right now, I’m using NLsolve.jl, because it was the most mentioned. First, it wasn’t converging, because my model requires non-negative variables. This could be fixed by changing the initial conditions, but I have to solve the system for 100 industries. Therefore, I’m still asking myself: should I change the method used by NLsolve? is this possible? should I use another package where I can establish bounds? So far, I noticed that if you use abs(x) for the solution x, everything goes well. But is this robust? Then, I read a post about solving non-linear systems as a minimization problem with 1 as objective function. Should I try this? all these questions are not important. Rather, my point is that everything starts because I didn’t find the documentation/posts useful as a beginner.

  2. Related to 2), I’m still thinking whether I should define a struct or simple define parameters and use functions so that the type can be inferred. Should I use modules more? I always read the blog by @bkamins and learned a lot. With super simple advice like “always write functions, and if the function is too long ask yourself whether you shouldn’t be splitting the function into two functions”.

Overall lesson: As an economist, I’d like to have a note/documentations I had an RA and I need to tell him step-by-step what he needs to get results. Even, I’d assume that the RA is sloppy and so you need to carefully specify what to do! You can always find a list of packages for optimizing a function, but which to pick for simple stuff? Sometimes it’s not about the package, but only tricks.

Right now, if an RA comes and asks me what he should learn, I’d answer: “I don’t really know. In my experience, I’d use R for regressions. However, I can’t tell you what packages to focus on, because I took a different path. For computational stuff, Julia seems good, but I’m still figuring out what packages/tricks are common”.

We economists always tend to be afraid of asking basic questions. Like if it’s embarrassing not knowing. In maths, and from what I can see in computation, the most simple matters tend to be the most important ones. And so they need to be explicit in documentations. We could write “x>y” and it’s obvious what that means… until someone asks “in which space? what order relation are you using?” and then you realize that you forgot “documenting” the basics of what you were doing.

17 Likes

It could also be interesting to compare

  • the amount of memory used
  • the number of lines (or characters) of code to do the same

This post was outstanding. Thank you for the details.

Yes, there is another side to your problem. Maybe there are no especially good and reliable packages at this point which are as robust as the python/matlab alternatives?

Everyone on these forums seems to suggest it is purely discovering the magic package and filling in documentation for it, but you can’t document what doesn’t exist or what is buggy and half-maintained. The hope for documentation in my mind is common interfaces in SciML. In this case there is a meta-package for nonlinear solvers https://docs.sciml.ai/dev/modules/NonlinearSolve/ The idea is that everyone can standardize on a common interface, algorithms can be swapped easily, and gradients or jacobians can be generated with easy.

Of course, that is only as good as the solvers that are available, and not everything is wrapped. This style of wrapping everything in a common interface is relatively new, but if everyone coordinates their resources to fill them in with docs/examples/unit tests and then fills in the features of the downstream packages that seem broken, it would help a great deal.

2 Likes

NLsolve.jl is actually very good; I’ve used it a lot. I have a test suite of 70+ problems from different fields which are often used to test new solvers and NLsolve.jl is able to find a solution to the vast majority. Its documentation is perhaps a little too slanted toward obtaining performance. Sometimes we just need to see a simple easy example to get us started.

Take a look at NLboxsolve.jl, which implements methods to solve systems of nonlinear equations subject to box constraints. If the examples on the documentation page aren’t informative enough, then look inside the test folder to get working code. These examples should translate pretty well back to NLsolve.jl.

By-the-way, trust region methods are based on reformulating the root-solving problem in terms of a minimization problem. So you should be able to use the trust_region method in NLsolve.jl, rather than reformulate your problem and pass it into something like Optim.jl

1 Like

Maybe for your specific optimization problem there is a definitive known answer to all these questions, which maps to a specific package that solvie all your problems, in Julia or other language. But maybe there isn’t. Having worked with some variety of problems in optimization, I always expect having to figure out how to guess good initial points, device globalization heuristics, check different parameters and methods. Maybe what is frustrating is that sometimes the problems just don’t have a simple recipe.

4 Likes

I share the sentiment that Julia’s documents are uninviting. Regarding the specific problem:

Right now, I’m using NLsolve.jl, because it was the most mentioned. First, it wasn’t converging, because my model requires non-negative variables. This could be fixed by changing the initial conditions, but I have to solve the system for 100 industries. Therefore, I’m still asking myself: should I change the method used by NLsolve? is this possible? should I use another package where I can establish bounds? So far, I noticed that if you use abs(x) for the solution x, everything goes well. But is this robust?

I feel that this is not a Julia issue. You wouldn’t be able to avoid the problem using other languages, i.e., they don’t have an automatic fix for non-negative variables.

For a non-negative parameter beta, I would suggest parameterizing it by beta = exp(c) and solving the model w.r.t. c as an unconstrained problem. It is robust. The beta_hat is easily recovered by exp(c_hat). The standard error can be obtained using the delta method; in this case of an exponential function, it is simply exp(c_hat) multiplied by the standard error of c_hat.

2 Likes

There’s actually a really good package for this: TransformVariables.jl. It lets you keep track of those transformations super easily.

I’m not sure Matlab or Python packages have this, actually (though i don’t know the space that well). Its convenient and might be considered a “win” for Julia.

3 Likes

But how is a new user (or even an experienced one) supposed to understand that they can use that package? From the Readme there’s no clue what the package does. But it does say

Work in progress. API will change rapidly, without warning.

4 Likes

This is great! But I think it is a little bit simplified. It is even better to clarify how dependent packages are managed in test environment. There are two kinds of dependent packages of a test file: 1) those imported by the main Package and 2) those solely used by tests (not imported or used in the main Package.) like Test.jl.

An excellent point. While we can fix runs to version numbers, that means we don’t benefit from bug fixes.

A key benefit of so many R packages is that they are seen to do the job correctly, so there is no need to update them. That brings stability. I got curious and looked at the library I was just including in my R code and saw it had not been updated for almost 20 years. Now that is stability.

2 Likes

In reading all these economics comments I get the impression economists are much worse trained in computations than we are. And since they seem such a large group, very useful to have in the Julia camp.

So, a humble proposal. There are excellent comments above. How about some knowledgable person making a honest Julia documentation page on githup on “How to use Julia as an Economist” (with no overselling) and prevail on the powers that be to make that prominent on the main Julia pages to help with visibility and google searches.

Then as a part of that have a list of Julia packages and what they do (and what is missing/incomplete), along with the Stata and R counterparts, and an honest assessment of the Julia packages.

@jlperla would be an excellent person to be in charge based on the comments above.

Any Julia powers that are willing and able to make this happen?

8 Likes

I am happy to contribute, as I am an academic economist and have used Julia for most of my programming, sometimes succeeding and sometimes failing to replicate typical workflows from R.

9 Likes

I am an academic economist using Julia. I came from an R+Matlab+SAS background. I still use SAS and MATLAB for a few coauthored projects, but the majority of my work is done in Julia. I would say for the most part, I am able to find packages that do what I need them to do. I’m a decent enough programmer + Googler that I can write packages myself, but it is nice when you can just take things off the shelf.

@tyleransom (another academic economist using Julia named Tyler) has slides that could be helpful in this regard: https://tyleransom.github.io/research/JuliaPresentation.pdf It has aged a bit so could use some updates.

7 Likes

I can contribute as well!

The source code for my slides is here. Everything on my site is MIT licensed. Pull requests welcome!

I want to point out that the one linked by @tbeason is over 6 years old now and goes back to version 0.6. The newer version (source code linked above) is here: Reference Material (githack.com)

9 Likes