A Comparison of Programming Languages in Economics

I agree that the sweet spot for many users is simple, easy to understand code that may be a bit slower than C but does not lose 1 or 2 orders of magnitude of performance. In my first experience with Julia (back in Julia 0.2) the Julia code I was working with was 2x slower than the Cython code that was available, but much more concise. I’m sure that with Julia 0.6 and more optimizations I could get the same timing, but that’s not the point: Julia was still 30x faster than the simple native Python code I could’ve written, and that’s what mattered.

I felt that, while I had to rely on somebody else to write the Cython code, the Julia code was something that I could write myself and this feeling was very encouraging: I believe it is one of the key selling points of the language.

3 Likes

Yes, very true.
The one thing though that I found early on though that made a big difference (many times faster, not just 10% sort of increases), is easy to explain, that I’d explain to people, is the use of @inbounds (and also explain why in C/C++ you don’t have the choice, but in Julia you can choose between the safety net and performance).

1 Like

For me, the problem was not that I couldn’t write the low-level C/C++ code (when I first started with Julia, I had 35 years experience writing performance oriented C code, and had to be convinced that I could get similar performance writing pure Julia), but rather, I was able to write one generic Julia function, that ended up generated 3 optimized methods, which I would have had to do by hand as 3 separate C functions.
Julia was a huge step forwards for me in performance, my own performance (and maintaining 1 function instead of 3 similar ones, is a big improvement also).

That’s one of the reasons why I think Julia should be looked at as much more than just a niche technical/numeric/scientific language, it also has many features that are very appealing for C / C++ / C# / Java programmers, who would love to improve their own productivity, at no / little cost to the final performance of the code.

4 Likes

There’s a big diff just broadcasting the minus

julia> a = randn(1000000);

julia> b = similar(a);

julia> f1(v, w) = maximum(abs.(v - w))
f1 (generic function with 1 method)

julia> @benchmark f1($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  15.26 MiB
  allocs estimate:  4
  --------------
  minimum time:     4.000 ms (0.00% GC)
  median time:      6.902 ms (27.11% GC)
  mean time:        6.271 ms (25.37% GC)
  maximum time:     8.152 ms (39.82% GC)
  --------------
  samples:          797
  evals/sample:     1

julia> f2(v, w) = maximum(abs.(v .- w))
f2 (generic function with 1 method)

julia> @benchmark f2($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.817 ms (0.00% GC)
  median time:      2.950 ms (0.00% GC)
  mean time:        3.953 ms (18.75% GC)
  maximum time:     6.492 ms (29.83% GC)
  --------------
  samples:          1263
  evals/sample:     1

julia> f3(v, w) = maximum(abs, v .- w)
f3 (generic function with 1 method)

julia> @benchmark f3($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.886 ms (0.00% GC)
  median time:      3.009 ms (0.00% GC)
  mean time:        4.011 ms (18.57% GC)
  maximum time:     6.773 ms (33.35% GC)
  --------------
  samples:          1244
  evals/sample:     1

julia> f4(v1, v2) = maximum(x -> abs(x[1]-x[2]), zip(v1, v2))
f4 (generic function with 1 method)

julia> @benchmark f4($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  4
  --------------
  minimum time:     2.238 ms (0.00% GC)
  median time:      2.299 ms (0.00% GC)
  mean time:        2.340 ms (0.00% GC)
  maximum time:     3.307 ms (0.00% GC)
  --------------
  samples:          2132
  evals/sample:     1

julia> f5(v1,v2) = maximum(abs(a-b) for (a, b) in zip(v1, v2))
f5 (generic function with 1 method)

julia> @benchmark f5($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  5
  --------------
  minimum time:     4.389 ms (0.00% GC)
  median time:      4.443 ms (0.00% GC)
  mean time:        4.497 ms (0.00% GC)
  maximum time:     5.847 ms (0.00% GC)
  --------------
  samples:          1111
  evals/sample:     1

“Premature optimization is the root of all evil”

but is totally fun to do. :popcorn: :smile:

6 Likes

As are most things we shouldn’t be doing…

Better though to take to heart Donald Knuth’s full quote:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

(Emphasis added mine :nerd_face:)

Too often I’ve seen the short form of that quote used over the years to excuse inefficient code, or to not think at all about the performance aspects of data structures and algorithms to be used in a design until too late.

  1. Think about performance from the beginning, when deciding on the architecture
  2. Write code that is robust and maintainable
  3. Benchmark thoroughly, and understand the bottlenecks with real world data.
  4. Then, at a point when it is no longer “premature”, optimize the hell out of that critical 3% of your application.
11 Likes

Did you get it to be faster than Numba?

I think we all know Julia is a fast language.
There is no question about it.
It also known to let the user full control which means the advanced user can generate efficient code.

Now I think the challenge is to make it fast while still be simple to use.
Namely using paradigms of people coming form other languages (Python / MATLAB).
The challange is the performance in the hands of the less than average user.

Good linter could probably find this…

Amen. To spur the discussions with a quote you sent me recently:

“A language that doesn’t affect the way you think about programming is not worth knowing”, Alan Perlis

Having a heavily optimized version is very important because it gives a sense of the asymptotic performance of each language (which you sometimes need), but it isn’t how users would want to write code if you don’t have to. The point of this benchmark was to have everything self contained to have consistent comparisons on speed.

But for solving real problems, if we are just writing autarkic code that looks like Fortran, we might as well stick with Fortran (or numpy and numba, which are good enough for that programming style). In Julia, a new version of these algorithms which pushes clean and elegant abstractions (and use any existing libraries, auto differentiation, etc) is now possible without leading to first-order performance problems.

So what would be more useful than micro-optimizations is the most elegant version of these algorithms, using as many libraries, functional programming patterns, and high level abstractions as appropriate (while keeping performance in maybe a 50-100% band?). Maybe economists don’t understand how to write that kind of code and would have to change how they think, but that is the point…

BigFloats just work. I can try algorithms without worrying about numerical stability, and only then when I know everything works as intended, worry about making things stable.
The peace of mind, and ease of only worrying about one problem at a time is nice.

The ability to just say"what if everything was BigFloats?" and then have everything work is lovely.

Or autodiff. I can use any package someone wrote in Julia, and odds are good it’ll work with zero effort.

Julia is a big box of Legos. They all just work and fit together to let you build things, whether or not they came from the same set, or were designed holding other pieces in mind.

Do we have that, do we offer that to others, when we write in Fortran or Numba?

EDIT:
Overall, I definitely agree. The big selling point is productivity, making life easy.
I just mean that it’s great we can optimize code, and still a lot of value in that over Fortran.

5 Likes

Where is the like button for this?

Well also, at some point in order to get top notch performance you do have to write “the good algorithm”: the one that’s non-allocating, that’s on the GPU, or distributed parallelism. As you start to control more details, it will invariably start to look more like C++ or Fortran just because you’re taking control of more and more. But still, even at that level, it’s much cleaner than actual C++ or Fortran. But most importantly, the transition is smooth. Since it’s all Julia code, you can still broadcast a few things if you want, and you can one-by-one replace broadcasts by hyper-optimized mutating LLVM calls. This is very different than writing a C kernel and needing to “get it right”. Additionally, you still get all of the abstractions and genericism of Julia, so your code can still work with BigFloats and can still be written in a type-based style even when it’s at this level.

I understand the importance of getting “first code” to be faster, but what a language also needs is a high upper limit to code speed since that determines the speed that packages and other optimized projects will be written at. In addition, because of various interprocedural optimizations and small function inlining, there’s also a lot more vertical optimization that’s allowed in larger integrated codes that are “Julia all the way down”. This is the stuff that truly matters for large projects and lets things like Celeste.jl be possible in pure Julia.

It’s a nice dream to want the compiler to do everything, but it is good to point out the reality of being able to collaboratively get good code in a single language because that “productivity language” allows someone to tweak a code all the way done to the bone.

5 Likes

I love that quote! My three all time favorite languages (Scheme, CLU, and Julia) were very good in that aspect.

It would be very cool (at least for beginner like me) to see some blog (or doc chapter) with example showing this gradually increased optimization from skin to the bone!

1 Like

Amen to that. Can we get Scotts 4 Laws carved on a stone tablet please?
Joking aside, those are good principles.

2 Likes

Okay. tldr of this comment: I switched to an Intel computer, and still can’t reproduce. Julia is still about 50% faster than Numba, whether or not Julia is using libimf or mkl.

My previous benchmarks were on a Ryzen (AMD) chip.
While I did build Julia from source, Julia does not actually fully benefit: LLVM 3.9.1 does not support Ryzen in the sense that version info says

 LLVM: libLLVM-3.9.1 (ORCJIT, generic)

instead of

LLVM: libLLVM-3.9.1 (ORCJIT, zenver1)

Additionally, on master, I get spammed with “zenver1 not recognized” warnings.
But, other than that, I’m not sure what the implications actually are. When I start julia with -O3, StaticArrays do get heavy SIMD (eg, SMatrix{3,3} * SVector{3) takes around 2 ns with -O3, vs more around 6ns -O2).
I’ll update these with actual copy + pastes when I get home (and set up ssh again…).

I tried setting LLVM_VER = 4.0.0 in Make.user, but it wouldn’t build. It just hangs on a patch. Maybe I can try going through the patches to see which are backports that are no longer necessary for 4.0.0…but unless someone tells me either “it’s easy, just do…” or that the benefit of having a supported processor is stupendous, I’ll just wait until Julia officially supports LLVM > 4.0.0.

Anyway, bringing up the processors again because I’m currently at my university computer, which is equipped with:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  BLAS: libmkl_rt
  LAPACK: libmkl_rt
  LIBM: libimf
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Julia was built from source, and linked to MKL.
I just downloaded and installed Anaconda. Figuring I can trust Intel when it comes to performance, I ran

conda config --add channels intel

before installing numba.

Results of running your scripts, julia with MKL:

$ julia RBC.jl
2018-02-13T11:19:29.693
2018-02-13T11:19:39.048
0.9355
$ julia RBC.jl
2018-02-13T11:19:42.185
2018-02-13T11:19:51.531
0.9346
$ julia RBC.jl
2018-02-13T11:19:56.862
2018-02-13T11:20:06.229
0.9367000000000001
$ julia RBC.jl
2018-02-13T11:20:11.23
2018-02-13T11:20:20.589
0.9359
$ julia -O3 RBC.jl
2018-02-13T11:20:41.661
2018-02-13T11:20:51.045
0.9384
$ julia -O3 RBC.jl
2018-02-13T11:20:58.442
2018-02-13T11:21:08.098
0.9656
$ julia -O3 RBC.jl
2018-02-13T11:21:15.534
2018-02-13T11:21:25.519
0.9985
$ julia RBC.jl
2018-02-13T11:21:42.455
2018-02-13T11:21:51.993
0.9538
$ julia RBC.jl
2018-02-13T11:22:12.554
2018-02-13T11:22:21.909
0.9355

I also have:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

OpenBLAS and libopenlibm instead of mkl and libimf. The result?

$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:58:45.152
2018-02-13T11:58:54.44
0.9288
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:00.121
2018-02-13T11:59:09.579
0.9458
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:21.31
2018-02-13T11:59:30.652
0.9342
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:35.405
2018-02-13T11:59:44.695
0.929
$ /home/celrod/Documents/julia-OB/usr/bin/julia RBC.jl
2018-02-13T11:59:48.826
2018-02-13T11:59:58.182
0.9356

No major difference based on whether we’re using an entirely free stack or MKL.

Python with intel’s channel:

$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:25:14.169929
2018-02-13 11:25:29.403137
1.5232970679178834
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:25:34.466773
2018-02-13 11:25:49.443639
1.4976639692205935
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:25:55.528475
2018-02-13 11:26:10.788838
1.5260133791249246
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:26:16.699890
2018-02-13 11:26:32.467552
1.5767432948108762
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:26:42.291484
2018-02-13 11:26:57.238478
1.494676438672468

Python without intel’s channel:

$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:55:01.450319
2018-02-13 11:55:16.455309
1.5004762928001583
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:55:23.097814
2018-02-13 11:55:38.407803
1.5309765067882837
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:55:43.886989
2018-02-13 11:55:59.518951
1.5631739235017448
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:56:39.370372
2018-02-13 11:56:54.729220
1.5358620896935462
$ /home/celrod/anaconda3/bin/python RBC.py
2018-02-13 11:56:58.620716
2018-02-13 11:57:13.877899
1.5256956543307751

No major difference depending on whether or not we’re using Intel’s channel.

RBC_CPP, g++7.2

$ g++-7 -Ofast -march=native RBC_CPP.cpp -o rbc_cpp && ./rbc_cpp 
 My check = 0.146549
 Elapsed time is   = 1.32139
$ g++-7 -fprofile-generate -Ofast -march=native RBC_CPP.cpp -o rbc_cpp && ./rbc_cpp 
 My check = 0.146549
 Elapsed time is   = 1.2594
$ ./rbc_cpp 
 My check = 0.146549
 Elapsed time is   = 1.2647
$ g++-7 -fprofile-use -Ofast -march=native RBC_CPP.cpp -o rbc_cpp && ./rbc_cpp 
 My check = 0.146549
 Elapsed time is   = 1.23953
$ ./rbc_cpp 
 My check = 0.146549 
 Elapsed time is   = 1.2377

RBC_CPP, icc 18.0.0

$ icc -xHost -Ofast RBC_CPP.cpp -o irbc_cpp && ./irbc_cpp 
 My check = 0.146549
 Elapsed time is   = 0.665514
$ icc -prof-gen -xHost -Ofast RBC_CPP.cpp -o irbc_cpp && ./irbc_cpp
 My check = 0.146549
 Elapsed time is   = 0.727978
$ ./irbc_cpp 
 My check = 0.146549
 Elapsed time is   = 0.734366
$ icc -prof-use -xHost -Ofast RBC_CPP.cpp -o irbc_cpp && ./irbc_cpp 
 My check = 0.146549
 Elapsed time is   = 0.636588
$ ./irbc_cpp 
 My check = 0.146549
 Elapsed time is   = 0.641704

RBC_CPP_2, g++7.2

$ g++-7 -Ofast -march=native RBC_CPP_2.cpp -o rbc_cpp_2 && ./rbc_cpp_2 
 My check = 0.146549
 Elapsed time is   = 1.30672 seconds.
$ g++-7 -fprofile-generate -Ofast -march=native RBC_CPP_2.cpp -o rbc_cpp_2 && ./rbc_cpp_2 
 My check = 0.146549
 Elapsed time is   = 1.23079 seconds.
$ ./rbc_cpp_2
 My check = 0.146549
 Elapsed time is   = 1.22811 seconds.
$ g++-7 -fprofile-use -Ofast -march=native RBC_CPP_2.cpp -o rbc_cpp_2 && ./rbc_cpp_2 
 My check = 0.146549
 Elapsed time is   = 1.30294 seconds.
$ ./rbc_cpp_2
 My check = 0.146549
 Elapsed time is   = 1.29535 seconds.

RBC_CPP_2, icc 18.0.0

$ icc -xHost -Ofast -std=c++11 RBC_CPP_2.cpp -o irbc_cpp_2 && ./irbc_cpp_2
 My check = 0.146549
 Elapsed time is   = 2.31978 seconds.
$ icc -prof-gen -xHost -Ofast -std=c++11 RBC_CPP_2.cpp -o irbc_cpp_2 && ./irbc_cpp_2
My check = 0.146549
Elapsed time is   = 4.41832 seconds.
$ ./irbc_cpp_2
My check = 0.146549
Elapsed time is   = 4.41458 seconds.
celrod@MMSGL50:~/Documents/programming
$ icc -prof-use -xHost -Ofast -std=c++11 RBC_CPP_2.cpp -o irbc_cpp_2 && ./irbc_cpp_2
My check = 0.146549
Elapsed time is   = 2.20696 seconds.
$ ./irbc_cpp_2
My check = 0.146549
Elapsed time is   = 2.21374 seconds.

RBC_F90, gfortran-7.2

$ gfortran-7 -Ofast -march=native RBC_F90.f90 -o rbc_f && ./rbc_f
 My check:  0.14654914390886931     
 Elapsed time is    1.32013893  
$ gfortran-7 -fprofile-generate -Ofast -march=native RBC_F90.f90 -o rbc_f && ./rbc_f
 My check:  0.14654914390886931     
 Elapsed time is    1.32421708    
$ ./rbc_f
 My check:  0.14654914390886931     
 Elapsed time is    1.36759603    
 $ gfortran-7 -fprofile-use -Ofast -march=native RBC_F90.f90 -o rbc_f && ./rbc_f
 My check:  0.14654914390886931     
 Elapsed time is    1.27381504    
$ ./rbc_f
 My check:  0.14654914390886931     
 Elapsed time is    1.29750502      

RBC_F90, ifort 18.0.0

$ ifort -xHost -Ofast RBC_F90.f90 -o irbc_f && ./irbc_f
 My check:  0.146549143908869     
 Elapsed time is   0.6529130    
$ ifort -prof-gen -xHost -Ofast RBC_F90.f90 -o irbc_f && ./irbc_f
 My check:  0.146549143908869     
 Elapsed time is   0.9747370    
$ ./irbc_f
 My check:  0.146549143908869     
 Elapsed time is   0.8385390    
$ ifort -prof-use -xHost -Ofast RBC_F90.f90 -o irbc_f && ./irbc_f
 My check:  0.146549143908869     
 Elapsed time is   0.6932410    
$ ./irbc_f
 My check:  0.146549143908869     
 Elapsed time is   0.6723280    

Intel was the clear winner for C++98 and Fortran, but lagged far behind for C++11 for some reason.

A rough hierarchy of results on this computer:
Intel’s compilers (minus C++11) > Julia > gcc > Numba > Intel’s C++11

Julia was the same fast, whether with OpenBLAS + openlibm or MKL + libimf, and in both cases roughly 50% faster than Numba. This is consistent with my results on the AMD computer.

However, my 3.60GHz i7’s Numba was slower than your 2.40Ghz i5.
FWIW, all my tests used Anaconda for python3.

2 Likes

Nice analysis!

I tried to test again but the results are still consistent as before on my Mac. So I went ahead and test it on Linux servers and found very different (opposite) results.

It’s documented here if anyone’s interested.

2 Likes

Interesting! The speed of numba on macs is impressive. Definitely seems to be something wrong with Anaconda on Linux.