Choosing a numerical programming language for economic research: Julia,

A nice collection of tutorials is GitHub - PaulSoderlind/JuliaTutorial: Julia Tutorial for Finance and Econometrics Students

8 Likes

Fair enough. I agree with @jlperla that there is a massive benefit to a centralized econ organization which provides these useful tools with good documentation and tutorials (though SciML benefits from lots of pharma money, and I donā€™t think there is a demand for niche performance tools in the private sector of economics)

Itā€™s still a nice package that hopefully can help out OP.

1 Like

Not sure about that. But I will do one last list of issues for those who might care.

People primarily driven by research (which might not even use Julia if they change the research direction) probably arenā€™t the key here.

I think what would make a huge difference would be someone with professional coding experience and a big open-source software grant who can go through and fill in the ecosystem. The SciML crew is doing their best, but it is an enormous undertaking and they have papers to ship. On that note, it seems to me that the backbone of many open-source projects out there are people who are not pure researchers (e.g. researchers from organizations who appreciate this sort of service work, teaching faculty, research software engineers) who apply for big grants to do the work.

If anyone out there loves this kind of stuff, they should apply for grants (I suggest working with the SciML organization in doing so - and if you are relatively junior they could even be a PI). Unless one has a big tech company who requires your platform for their research and operations (e.g. pytorch or jax) somebody needs to pay for professional software development. Search up on the funding of jupyter+binder/jupyterbook, Stan, R, numpy/scipy, etc. sometime to see the resources they have dumped in.

Any problems in the julia ecosystem are about resources (and ensuring there is focus given limited resources), not the language itself.

First, a shout out to all of the people (e.g. @tbeason and @mcreel and many others) who have created amazing tutorials. I think those are all necessary but not sufficient for success.

As I said, my perspective here is that the source of issue is not training, discovery, or (directly) documentation. Instead, it is about inconsistent code quality and uniform code interfaces/testing/benchmarking/maintenance of key packages. Economists may have less training
in general, but even the ones who have plenty of experience (and are even told which package to try!) are having trouble. That certainly includes me.

Of course, people need to use different algorithms in different circumstances and there is no way to avoid teaching them those tradeoffs but with julia there is a signal extraction problem. Is it failing because you chose the wrong package, there is a bug in that package, a bug in an upstream package you didnā€™t know existed, because you used the package wrong, or because it wasnā€™t the ideal method for your problem? It is hard to triage that - even for people who know what they are doing.

In something like matlab it just amounts to changing flags in the optimizer and/or supplying gradients to choose between different algorithms. It wonā€™t have bugs and you wonā€™t be calling the interface wrong, so you can focus on trying out different algorithms. No signal extraction problem or triage process required. Similarly, in matlab or python if I want to try various types of interpolate I donā€™t need to search around between 5 different packages - each with their own missing features and quirks and possible performance regressions.

Not to say that these happen all the time and with every package, of course. But it is a lot more frequent than other environments.

And just to be clear: key packages here could just call out to existing C code like NLopt or Dierkx if it was found to be the most robust. Julia long ago fixed the binary dependency installation woes so there is no reason to fear binary dependencies and plenty to love about them. It means someone is maintaining a separate implementation which probably means more testing! Pure Julia implementations sometimes have benefits (e.g., it can achieve better inlining, work with static arrays, handle generic types better) but the baseline Float64 vector versions can all call out to C or Fortran and a user wouldnā€™t care. Even auto-differentiation often works fine with external packages because you want standardized custom AD rules (e.g. https://sensitivity.sciml.ai/dev/ calculations donā€™t require you to use the primal calculation in julia).

What matters, then, is having a consolidated interface to swap algorithms and good coverage/maintenance of essential algorithms. And to deprecate packages from that common interface when they fall into disrepair.

I think the first step is to focus. As far as I am concerned, what everyone is doing with DataFrames/FixedEffectModels/etc. is all awesome. I donā€™t see any reason to push hard on that because it isnā€™t Juliaā€™s bread and butter. If all someone is doing is data work where specialized estiamtion pacakges already exist and they know R, they probably should just use R rather than switch to Julia.

If someone knows julia really well or the data stuff is a small part of a project, then I think those packages are all on the right track. Just make sure they donā€™t break (e.g. CSV woes earlier this year).

For bread-and-butter scientific computing there is no reason that Julia should be more frustrating than matlab/scipy. From my side, the high level list and coordination device for this is mostly there: https://docs.sciml.ai/dev/

The problem is that this is a lot more than just listing out packages and adding in documentation.

  • Some of the things listed (e.g. Interpolation) are deceiving because the interface doesnā€™t handle the diversity of the tasks.
    • The DataInterpolations package does what it does well, but it doesnā€™t cover multi-dimensional interpolation. But it isnā€™t like you can just switch to Interpolations.jl because it doesnā€™t support irregular grid cubic spline - etc.
    • So you need to go through and ensure all of the interface support the variations that are required without loss of structure (e.g. you need to distinguish between a regular and irregular grid) and then ensure all of the backend packages actually work, then benchmark and document the algorithm choice in different circumstances. Hopefully documentation + benchmarking + downstream tests will keep the ecosystem in check without monorepos.
  • Others might not implement the algorithms that economists really need and additional backend algorithms or features are required
  • The https://docs.sciml.ai/dev/modules/Optimization/ wrapper is essential, but it is just getting started. Finding a consistent call interface and solution return type is tricky in that domain.
    • As one small examples, doesnā€™t have good ways to swap the sense for all of the optimizers consistently.
    • It doesnā€™t have the interfaces to deal with constrained optimization especially well - especially if you want to use AD to get a sparse jacobian of the constraints and the objective separately. There is no support for Complementarity conditions - which are essential in a lot of economics applications.
    • For people solving ā€œrealā€ problems, they often find the commerical optimizers significantly better. We need to show those examples and test things like accessing with Knitro (and add in software if the Optimizers.jl to MOI interface doesnā€™t work).
    • This needs tonnes of tests and examples to ensure that it works seamlessly - which is a lot of work because every optimizer has different arguments and return types.
    • It is necessary to reconcile it with other great ecosystems such as JuliaSmoothOptimizers and it isnā€™t obvious how to achieve that.
  • Showing consistent patterns of new things that are available by using Julia is a lot of work.
    • For example, when and how can you use symbolic expressions and how does it generate jacobians/gradients/etc.? How to use ā€œLabelledArraysā€ in each of them (if appropriate).
    • How can we get gradients of solutions with respect to parameters, and what does that open up.
  • Also, let me point out the importance of gradients.
    • The biggest benefit that economists can get is to start providing gradients to their algorithms, which lets them use new classes of algoritms. In principle, with things like Symbolics and upcoming projects like Enzyme it will be far easier to do that then anything in matlab or scipy.
    • Furthermore, people may realize that if they can differentiate their components of the model with respect to the parameters themselves through smart use of AD rules, then Julia can enable whole new clasess of algorithms and methods.
  • On that note, cross your fingers that Enzyme.jl works out, and provide as many bug reports and use-cases to that project as you can when it is ready to be released into the wild. Zygote is coming towards its end-of-life and ForwardDiff is limited. Without good AD the best Julai could hope for is to match the stability of python/matlab packages. With good AD and the combination of symbolics+AD it can vastly surpass it.

So my list is basically that of Chrisā€™s (and I am purposefully leaving out things like linear regressions which seem on track for now). Do things like pick up a copy of CompEcon ā€“ Paul L. Fackler and ensure that all of the key general-purpose algorithms are covered, etcā€¦ I think it starts with getting the downstream packages solid and maintained for all of the things people need, and then the next step is filling in that documentation/unit tests/benchmarking. In that process people will find a lot of bugs.

For sure. But SciML is an amazing organization that is willing to shoulder the burden of consolidating the scientific computing ecosystem here - and there are many complementarities and spillovers from pharma applications. It just isnā€™t a perfect fit, which is why it wonā€™t happen without our involvement.

Just to repeat my statement at the top: package quality/documentation happened with Matlab because academics paid $$$ to them to write/test/document/maintain packages, and it happened with scipy/jupyter/etc. because someone went to ask grant agencies for a lot of money. My understanding is that these didnā€™t require huge numbers of people, but rather they required a few talented professional developers and enough time to focus rather than writing papers. If that is of interest to anyone who wants to take on components of the ecosystem and they can dedicate the time, then ask agencies for money (within the SciML umbrella)!

Sorry for the long list here. It is my last because I think it fully summarizes my feelings on where the source of the real issues are and the best approaches to fix them. Listen to other economists opinions as well (especially economists giving a ā€œI tried Juliaā€ report), since I may have a limited perspective and already know julia too well.

8 Likes

What do you have in mind? At least both R and Julia are 1-based (and column-major), so seem like a good match. Iā€™ve just tried to use RCall.jl but donā€™t use R regularly. I however also use Python (and through PyCall at the time, would now use PythonCall), and with Python (and e.g. C) 0-based, then that could be a source of bugs (as with OffsetArrays.jl). Python and C/C++ being row-major however, when using with or converting code to/from Julia, wouldnā€™t be a bug, only a performance pitfall to know about.

I would have thought RCall, with its R REPL mode being pretty sweet, for those knowing and sometimes preferring that language. [Is such also needed in PythonCall?]

I see (non-default, unlike for PythonCall that download Python for you by default):

When R_HOME is set to "*", RCall.jl will automatically install R for you using Conda.

@cjdoris PythonCall is pretty sweet (compared to PyCall), in the sense that it take care of Python package dependencies for you (or indirectly with Conda/Mamba). I suppose RCall could do similar, and even reuse the same infrastructure, to make installation transparent. I donā€™t know what most R users use for CRAN, Conda (or RStudio Package Manager, which also supports PyPi)?

Iā€™m a big believer in using Julia and other languages, e.g. Python or R (for its libraries, but also Julia from other languages, if people can be persuaded to do so). Possibly we need more wrappers for their code?

@y.lin can you clarify:

The reason is that while Stata offers much better algorithms than any of the four languages, users engage with algorithms written by Stata rather than writing their own.

Which one or more algorithms to you have in mind (just curious what could be improved in e.g. Julia)? Iā€™m a bit surprised that slower or somehow worse elsewhere (in other more established languages, e.g. R). Did you have something else in mind, than slower, something workflow-related and ā€œalgorithmā€ then not right or misleading term?

Thanks for all the advice!!! Iā€™ll definitely explore all the suggestions you made in the next few days, when Iā€™ll retake the computation of the model.

Coming back to the topic here, I think there are two matters:

  1. how to improve packages, which @jperla talks about and I canā€™t say anything meaningful about it.
  2. how a new user could learn these tools more smoothly. Iā€™ll refer exclusively to economists (rather than for general advice for any type of user), since others could have different needs.

Iā€™ll refer only to 2). I cannot give a solution, but provide an example of a path I think would be worth exploring.

First, let me emphasize some aspects. I imagine anyone who is reading this post is interested in the topic. By this, I mean that all of us will invest enough time until it learns the tools. However, itā€™s not that type of user that Iā€™m concerned about. Even, I know Iā€™ll spend weeks and months until I learn all the tricks.

Rather, Iā€™m thinking of a new user, who doesnā€™t necessarily have time or will use Julia/R every day. And even for people who have to learn the tool, itā€™s about making these peopleā€™s lives easier and simultaneously not getting bad habits. Reaching a broad audience is what you eventually need to support Julia, since it creates the necessary scale to attract financing. So, in this sense, itā€™s also related to what @jlperla was saying.

Second, my perspective is not about defending Julia. As this postā€™s title, Iā€™m thinking of whether people should use Julia/Python/R.

For the example, I imagine that someone asks me whether he should learn Latex, and how to do so. Given my experience with Latex, this is how Iā€™d tackle the matter.

First, Iā€™d ask him ā€œwhat will be your use?ā€ If youā€™re aiming at industry/use it too sporadically, donā€™t learn Latex. I donā€™t think youā€™ll need it, and quite likely everyone will be using Office/LibreOffice. So, learning a new tool is always good, but the time is finite, so I donā€™t think itā€™s optimal.

Second, assume youā€™ll work in academy or any sector that uses a lot of math. Then, learning Latex is a valuable tool. So, the question here is: how to spend your time learning, without wasting too much time on finding an answer? Thereā€™s no unique answer, but itā€™s really valuable to know different peopleā€™s experiences.

What would I recommend? I use a hybrid method between Lyx and pure Latex. I use pure Latex for the preamble and some specific functions. Instead, Lyx allows me to use keybord shortcuts. Right now, I press ctrl+o for all the operators I commonly use, ctrl+g for all Greek letters, etc. You can even modify its menu and calling it with keyboard shortcuts. For instance, if I press alt+1, I get the menu ā€œDecorā€ I created for my version of Lyx:

or you can press ctrl+d and get:
pic03
Why Iā€™m recommending this? Because youā€™ll increase your productivity. For instance, I do all the algebra in Latex. Yesterday, I literally wrote 60 pages of pure algebra.

And if you need pure Latex, e.g. to create an environment, you can do it. Moreover, Lyx allows you to create then a keyboard shortcut for that environment (menu ā€œRemarkā€ in my case).

Point of the example: This way of using Latex worked for meā€”itā€™s not the truth, but just an approach that suited me and could work for others. Having notes with different opinions is more valuable than discussions about whether Julia is faster than Python. And I think laying out the motives behind the recommendation is key: it can let you know whether the approach suits your needs. For example, people used to tell me that I should learn pure Latex. Why? just because. And, until I found Lyx, I was finding that was answering my concerns about productivity.

Right now, regarding the choice of Julia/R/Python, I feel like Iā€™m spending more time finding those opinions. You only tend to find discussions oriented to package creators/experienced programmers, which are not key for new users from economics. I think I started using Julia for the wrong reasons, but kept using it because of good reasons (eg, I chose it because of its speed, but stayed because it has a more natural syntax for scientific matters).

2 Likes

Maybe Iā€™m stating the obvious, but itā€™s generally a good idea for new researchers (=PhD students) to ask their advisors about choice of software, as the advisors will have a better understanding of the problems that the student will be facing (compared to, say, people on an internet forum, or the authors of a VoxEU column).

2 Likes

Although thereā€™s also a good chance you then end up progamming in FORTRAN or GAUSS :slight_smile:

7 Likes

Yes, that thought crossed my mind as well, but I wouldnā€™t feel that itā€™s right to recommend a student to disregard the instructions of their advisors even in that case. Iā€™ve seen advisors (in structural labor, where Fortran is still being used a lot) go over their advisees Fortran codes to an extent that probably would not have happened if they had chosen a more modern language.

Bottom line (and then Iā€™m out), when you decide which language to learn to do economics research, Iā€™d advise speaking to people who understand what you want to do (in the medium term, and in the long term), what your background is, and what your situation is (including the support you can get from advisors, colleagues, friends, documentation, and the internet).

1 Like

Actually I always wondered why there werenā€™t more structural people around in the Julia community, it seems their usescases would be well addressed by a language that allows coding high performance models from scratch (not unlike quant macro)? Iā€™m aware of the pyBLP project, something like that feels like it would have had a more natural home in Julia?

Macro, macrofinance, and corporate finance structural models are beginning to pop up in Julia. I think you are right that micro models like demand estimation that you mentioned have not come around as much. I would consider it only a matter of time.

I get 2.85x speedup on top of your 2.6x for 7.4x over @y.linā€™s, and yet 2x more should be possible. I was trying out different things (e.g. log/exp to get rid of division, scroll to right to see), that didnā€™t pan out, and also just lower precision (did work but uncovered a minor bug in LoopVectorization):

julia> @time ( function likelihood4(o, a, b, hā‚€, y2)
           h = zeros(Float32, length(y2))  # h = similar(y2)
           lik = Float16(0.0)
           h[begin] = hā‚€; my_range = eachindex(y2)[begin+1:end]
           @inbounds for i in my_range
               h[i] = o + a * y2[i-1] + b * h[i-1]
           end
           @turbo for i in my_range
               lik += log(convert(Float16, h[i])) + convert(Float16, y2[i]) / convert(Float16, h[i]) # log_h = log(h[i]); lik += log_h + ā„Æ^(log(y2[i]) - log_h) 
           end
           return lik
       end )
  0.000485 seconds (54 allocations: 9.308 KiB)  # that timing is bogus!

julia> @btime lik2 = likelihood4($o, $a, $b, $v, $y2)
  22.606 Ī¼s (1 allocation: 11.88 KiB)
26360.732f0

julia> @btime lik2 = likelihood2($o, $a, $b, $v, $y2)
  64.436 Ī¼s (1 allocation: 6.00 KiB)
26360.52849330753

@elrod, For some reason, I get Float32 number back, but should get Float16 back, and do if @tturbo or @turbo arenā€™t used, so thatā€™s the minor bug (could lead to type-instability elsewhere, not a bug, just inconvenience).

I see you updated the appendix:

Post publication p.s.

After our article came out, Jorge Vieyra suggested here some ways to speed up the Julia calculation. It makes use of the @turbo macro [ā€¦] While we think this proposal is interesting, it is not a fair comparison with the calculations above because they depend on specialist knowledge [ā€¦]

Note how @turbo requires rewriting the likelihood function by summing the likelihood separately

I did first:

y2 = rand(Float16, 3000); y = y2; # the rest is the same, I didn't have "data.dat" or sensible data, not sure it matters, or the size)

@time using LoopVectorization
  3.918259 seconds (5.64 M allocations: 372.844 MiB, 36.23% compilation time: 80% of which was recompilation)

To be fair, that time alone swamps the runtime of all the actual likelihood functions, but you should be able to eliminate that fixed small startup cost with a sysimage. Actually compiling the code took also almost 10 sec, for combined 12.44 sec. while compiling the original likelihood function was immediate. That compilation overhead should also be possible to reduce to zero. Itā€™s unclear to me if

Note, for the original @time function likelihood ... is immediate, and LoopVectorization is missing some precompilation statements, at least if the code were in a package code would be precompiled (not fully for packages, a TODO for Julia, but I think already if that package were also in the sysimage).

Is all this ā€œspecialist knowledgeā€, I suppose so for the sysimage, how to reduce latency, and reducing precision. But you only need to do it once in your code (for the h array). For the y2 input array, itā€™s best itā€™s already in the reduced precision or youā€™re just shifting work around. For some code, that not O(n), it might still be worthwhile to make a lower-precision copy (e.g. in the function), for other stuff, it would be best to handle with something like:

readdlm(Float16, "data.dat") # Not (yet), possible, Float64 is a sensible default, and I'm not sure this is possible with some other syntax, likely in some other package, CSV.jl?

ā€œHalfā€ (aka Float16) isnā€™t defined in C standard (so unfair to use? I donā€™t think so, just Julia better). It is available for, and at least in GCC extension.

Thank you all for the thoughtful discussion. If we can insert a link into the published VoxEU piece with a curated list of computational resources for economists, what would you recommend? If no such list exists, we would be happy to host it on our server if you send us recommendations, but it would be better if there was a GitHub page we all could contribute to.

2 Likes

We are here! I think that a lot of this is inertia. More senior people tend to use fortran still, but thereā€™s a solid contingent of younger researchers who mostly tend to use julia. Almost all of the structural people in my phd cohort (who just went on the market this past year) use julia for instanceā€¦

3 Likes

That is very fair. I will try to draft some language for that, although, after thinking about it, I think it would be better suited for the documentation in Pkg.jl as a separate PR. Will post back later.

Also, to refresh my memory, if you are using a Project.toml to manage the test environment of some package environment (Iā€™ll call mypkg), do you not have to specify dependent packages in the test Project.toml even if they are used in the mypkg environment per you point in 1? My understanding was that, yes, you can have conditional test dependencies that are only available in a test environment but that the activated test environment could not ā€œpull inā€ packages from the mypkg environment without explicitly adding them (except for mypkg itself). I may be misunderstanding your first point but I do understand your second.

Thanks!

I am very confused about the test environment, especially after I have read the Pkg.jlā€™s documentation which says if I remember correctly test environment is experimental. Therefore, the tutorial you have drafted shall help me a lot. I never actually have faith in using test functionality correctly.

I tend to run include("any_test_scripts.jl") in mypkg environment instead of using test command.

1 Like

Hopefully relevant enough to necropost: just come across this

https://nittanylion.github.io/Grumps.jl/dev/

which implements a cutting-edge estimator for BLP-style models described in this paper:

so it seems that yes, structural IO people are using Julia after all!

3 Likes