Julia vs Fortran complaint

I know nothing about the source of the quote in question, but let me give you a real-world example in which I’ve often found that Julia code can be faster than comparable algorithms in production-quality Fortran: special-function implementations.

For example, I implemented an erfinv function Julia (https://github.com/JuliaLang/julia/pull/2987), and it was about 3x faster than Matlab or SciPy’s erfinv function, both of which are taken from standard Fortran libraries. (This is benchmarking single-threaded vectorized calls on large arrays where the Matlab/Python overhead should be negligible.) The underlying algorithm is similar to those used in the Fortran routines (in Matlab’s case this is only a guess), because almost everyone uses the same rational-function approximations published in the 1970s.

I have found similar gains (compared to Fortran code called in SciPy) for other special functions, e.g. polygamma functions (https://github.com/JuliaLang/julia/pull/7125) and exponential integrals (https://github.com/JuliaMath/SpecialFunctions.jl/issues/19).

The reason Julia can beat the Fortran code is that metaprogramming makes it easy to apply performance optimizations that are awkward in Fortran. We have metaprogramming macros (@evalpoly) that can easily inline polynomial evaluations, whereas the Fortran code makes function calls that loop over look-up tables of polynomial coefficients. Even greater speedups are possible for evaluating polynomials of complex arguments, where there is a fancy recurrence from Knuth that is almost impossible to use effectively without code generation. In principle, the Fortran authors could have done the same inlining and gotten similar performance, but the code would have been much more painful to write by hand. (They could even have written a program to generate Fortran code, but that is even more painful.)


For what it’s worth, I think that the explanation you just gave is infinitely preferable to what appears in that post.

I don’t think the original quote was that problematic if read in a reasonable way. Obviously, it was a situation in which someone ported some low-level kernel from Fortran to Julia, discovered some additional optimizations along the way (possibly made easier by Julia), and ended up with a 30% speedup. This kind of thing is not unreasonable — it happens all the time if you have a language in which you can write high-performance code. Whereas in Matlab/R/Python it is simply not possible to port low-level kernels from Fortran and get speedups unless you find truly fantastic algorithmic improvements [like going from O(n²) to O(n)] or find a new library routine (usually written in C or Fortran or similar!) that is perfectly suited to your problem. The fact that it is possible to do meaningful performance optimization of low-level kernels in Julia is the point here.

The NAG comment that it is impossible for Julia code to be faster than Fortran because they are both calling the same LAPACK/BLAS is just not reasonable, in my opinion. Dense linear algebra in every language has basically the same performance for this reason — obviously, this sort of code is not what any of the quotes was referring to.


It could also be a case where Julia was able to figure out optimizations based on specializing on types, that didn’t happen automatically in the original language (Fortran, or in my case, C).
The moment that I really fell in love with the Julia language was when I saw it take some string handling code and optimize it (as I might have done by hand in C, writing 3 or 4 separate functions), based on whether the type passed to my generic function was ASCIIString, UTF8String, UTF16String or UTF32String.
That was example of saving both human time (mine) and run-time, which I feel is Julia’s forte.


Also, the BLAS and LAPACK kernels that Julia uses are, in the important cases, not the reference BLAS written in Fortran but rather the multi-threaded BLAS/LAPACK in OpenBLAS and MKL. Indeed the big problem with reference BLAS and LAPACK is that they need to be written in Fortran 77.

I can say from experience that the biggest problem in porting R, and before that S, to new architectures was often the need to find a freely available Fortran compiler that was compatible with the local C compiler so that the BLAS/LAPACK code could be compiled.

I would encourage people interested in this topic to attend the Celeste keynote at JuliaCon on Thursday.
I’ll be talking a bit about how julia enables really high performance applications. I think there’s two major
aspects to performance here that is easily missed:

  • How efficient is the code that it generates for your target architecture

This is of course an on-going battle, but julia is getting VERY good at generating high-performance native code
for well-typed julia code (and there’s a number of changes yet in the pipeline). A lot of people get hung
up on this point, because for many languages (especially dynamic ones), this is the major point of
differentiation with high-performance languages (C/C++/Fortran, etc). The high performance languages
generate good good, the low performance ones don’t, end of story. For julia however, that’s not the
concern. Once your julia code is well-typed, the code that comes out is essentially the same
that would come out of a C/C++/Fortran compiler for the same algorithm (unsurprisingly of course,
because everyone uses LLVM nowadays). Nevertheless, that’s not the end of the performance story.

  • Data Layout, Cache Efficiency, etc.

Once the code generation story is settled, there’s still a significant amount of performance to be had
by optimizing for the hardware. Modern architectures are amazingly complicated and accommodating
their ticks (esp with respect to cache efficiency, vectorization) is absolutely required for high performance
(10-100x performance increases are possible here). People who do HPC in C/C++/Fortran know this
of course, and HPC apps written in those languages get heavily optimized with that in mind. However,
as Celeste demonstrates, the same is absolutely possible to do in Julia (and necessary for high performance).
The only remaining question then is how difficult this is to do. Personally, I find it orders of magnitude easier
to do in Julia. Static arrays, SoA transformations and even more fancy program transformations (some
of which I’ll talk about at the Celeste keynote), are essentially one line changes in julia - but significantly
harder in other languages.

To summarize, once you get into performance comparisons of HPC apps, whether it’s written in Julia, C, C++ or Fortran
doesn’t really matter for the performance. What matters is how well the app takes advantage of the hardware.
I’d argue that’s easier to do in julia. I would absolutely not be surprised to see a carefully optimized julia application
outperform unoptimized Fortran by 100x (and the reverse is also true of course, but for some reason, people
have an easier time believing that). The reason julia calls out to C and Fortran libraries (including BLAS), is that people
there have done the hard work of taking optimal advantage of the hardware, not because those languages generate better
code. In the future we may see people calling out to julia libraries from other languages.


As for the performance claims in the blog post. I understand where the uneasiness with those quotes comes from. I’ve
had the same discussion with our marketing folks. However, they are real quotes from our actual customers
that have replaced their legacy code bases with julia implementations. That’s of course not a “language speed
benchmark” and part of the improvement certainly comes from redoing the implementation and incorporating
learnings from the legacy solutions. So I’d take it for what it is. It’s testimonials from some
people who’ve switched to julia and saw huge benefits. I think we’d be remiss not to advertise that they
were able to do that.


I think changing “Julia provides” to “Julia has provided” fixes most of the problem here. The quotes from customers are of course valuable.


Sharing another experience. Cuba.jl is a package for numerical integration, a wrapper around the C library Cuba. Here is a comparison between runtimes of Julia, C, and Fortran programs calling the same library and integrating the same 11-element vectorial function in three dimensions with four different algorithms (Vegas, Suave, Divonne, Cuhre):

INFO: Performance of Cuba.jl:
  0.271304 seconds (Vegas)
  0.579783 seconds (Suave)
  0.329504 seconds (Divonne)
  0.238852 seconds (Cuhre)
INFO: Performance of Cuba Library in C:
  0.319799 seconds (Vegas)
  0.619774 seconds (Suave)
  0.340317 seconds (Divonne)
  0.266906 seconds (Cuhre)
INFO: Performance of Cuba Library in Fortran:
  0.272000 seconds (Vegas)
  0.584000 seconds (Suave)
  0.308000 seconds (Divonne)
  0.232000 seconds (Cuhre)

The Julia program is always faster than the C one, and in two cases faster than the Fortran one (in one case very close to it). The difference between the three programs is the computation of the integrand function, which is passed to the same library. In addition, the Julia program has less than 50 lines of code (and could have been even 40), the C program about 100 lines of code, and the Fortran program about 130 lines of code.

To summarize: with less than half the lines of code, in Julia you have similar or better performance than programs written in C and Fortran ultimately calling the same library for numerical integration.

Unfortunately, Cuba.jl cannot take advantage of parallelization capability of Cuba library because Julia doesn’t support fork. Would be cool to see how Cuba.jl performs in that case.


OT: Cubature.jl (similar to the Cuhre algorithm in Cuba) can give you a vector of points at which to evaluate your integrand function, instead of just one point at a time, so you could farm these out to parallel processes in Julia (with a work queue implemented in e.g. Julia RPC or MPI) or use threads etc.


Also Cuba supports vectorization (see https://cubajl.readthedocs.io/en/stable/#vectorization), in addition it has parallelization. Your suggestion is interesting, thanks!

Slightly OT too: NIntegration.jl (which I started writing recently, and is still a WIP) implements the same algorithm that cuba and cubature (Cuhre) in pure Julia and is 2x time faster than the cuba version (in 3D).

It can also give you a weights and points if you want to use them in a parallel process.


This is a really interesting discussion, and I’m looking forward to the upcoming talks getting posted on YouTube so I can watch them and later use them to proselytize my friends :wink:.

I’m curious, does any of this have to do with the fact that Julia’s compiler is basically a static compiler and does not rely on run-time optimizations? I know that most other JIT’s rely on run-time optimizations in some way but I’m assuming there are fundamental limitations on how much benefit they can provide. Also, are the things that are being said of Julia generating efficient, platform-specific code things that can also be said of, for instance, Rust and Go?

Compared to, e.g. Python and R, one key difference is that well-written Julia code makes it possible to have good static compilation without losing genericity. In other dynamic languages, they have no choice but to do some kind of tracing JIT (unless you add type hints that make the code much less generic, ala Cython), because they need additional runtime information about which types are present in order to produce reasonable compiled code, and tracing JITs inherently make it difficult to predictably get good performance.

Sure. But those are statically typed languages that are poorly suited to interactive computation. Lots of static languages have good compilers.

1 Like

I certainly see that this would be the case with Python and R since they were designed as interpreted languages and give a would-be compiler few clues as to what it’s actually expected to do. I guess my question is, aren’t most of the things that are being said about performance in this thread things that can equally apply to Java or Scala?

A year ago, when I would tell people I use Julia the first question they’d ask me is “Why not use Python?” and the second question they’d ask me is “Why not use Scala?” Of course I’m sold on Julia because I like the language design much better than Scala, but I’m wondering if the performance advantages are anything more than incidental. (Interestingly, Scala use seems to have fallen off a cliff and it seems to be kept alive mainly by Spark lately.)


1 Like

Like Go and Rust, those are also statically typed languages. Everyone knows that you can get good performance from any decent static language. But static languages aren’t well suited to interactive usage/exploration. That’s why the usual points of comparison for Julia are other dynamic languages like Matlab, Python, and R.


For what it’s worth, https://github.com/ParRes/Kernels has ports of interesting HPC kernels to Julia, Python, Rust, Octave/Matlab, C++ (+many threading models), Fortran 2008 (+multiple parallel models), C89-ish (+many parallel models, both shared and distributed), and Chapel (in a branch at the moment).

I found that Julia was significantly better than Python for wavefront parallelism, because unlike independent data parallelism, Numpy has no intrinsic for this (that I know of). All other performance comparisons are left as an exercise for the reader.

Most of the shared-memory ports of the PRKs took me a few hours, so you should have no trouble implementing new languages that you want to compare to Julia (if you create these, please submit a pull request).

1 Like

I believe - nowdays there is no reason to speak about the such tremendous speedups without taking in accounts software and hardware details. On the second hand… one of the Julia\s authors goals - to be only on 10-20% SLOWER then compile-time workhorses upon most common numerical calculations in practice - QUITE FEASIBLE.

I implemented a pure-Julia arbitrary-dimensional version of the Genz-Malik algorithm at:

It should be fully functional, and a big advantage of being pure Julia is that it is type-generic. (e.g. your integrand function can be Float32, Complex{BigFloat}, Matrix{Float64}, etcetera and it will all work.)