Julia vs Fortran complaint

A member of NAG made the following comments and I thought it might be of interest if someone more knowledgeable than myself from the Julia community could provide a response (the comments they made were in response to a recent Julia Computing announcement Seed Funding)

I would definitely question those speed ups of a 1000x, 225x and the 1.3 times faster than Fortran. I really think they are very misleading because it doesn’t give any detail. Nvidia used to massage some of their speed ups which they have stopped doing after a flurry of complaints…
What I object to is misleading and false advertising. Julia calls a lot of Fortran linear algebra kernels via BLAS and LAPACK, so it’s contradictory to say that Julia is 1.3 times faster than Fortran. It is like saying Fortran is 1.3 times faster than Fortran. If people want to make such outlandish claims that something is 1000x time faster, then they should provide the details of their methodology.

1 Like

Outsider view:

I would point out that the “1.3 times faster than Fortran” isn’t a claim by Julia Computing, but a blurb from a bank that uses Julia, and is presumably referring to a specific use case.

The 1000x faster is also a quote, and explicitly refers to a specific piece of software.

I do think the ways the quotes are used look a bit strange, and that this paragraph.

Julia is lightning fast. Julia provides speed improvements up to 1,000x for insurance model estimation, 225x for parallel supercomputing image analysis and 11x for macroeconomic modeling.

comes off a bit “fast and loose” and context-free. But the criticism from the NAG member misses the mark a bit, possibly the claims have been read out of context.

The previous paragraph could be made more specific and credible if it said: “In real use cases Julia has provided speed improvements up to 1000x etc.etc.” or something like that. As it stands, it seems like Julia routinely and reliably provides speed updates of several orders of magnitude(!)

2 Likes

Performance comparisons are always problematic, because they are always far more complicated than they appear. To be meaningful, such comparisons must be ultra specific, which this is not. That said, I would agree that the statement about Fortran is probably misleading. Furthermore that paragraph strikes me as “business-speak” (also known to Neal Stephenson fans as “bullshit”) and all “business-speak” should be ignored. To be fair, it doesn’t appear to be coming from the Julia devs, and they are probably in a difficult position, as “business-speak” is usually a requisite for advertising to business people.

Yes, that snippet has been noted before, I think in Github issues and on Discourse. It does not read well to developers because it’s obviously missing details. It should really say that its from real experiences, and not some blanket statement about “speed”.

But there is truth in there that is being ignored by this comment. Sure, if someone wrote a very large software in Fortran which is perfectly optimal, Julia probably can’t optimize that as well as Fortran because it can alias pointers among other things which can disable some compiler optimizations. However, without any pain, Julia will get you <2x from Fortran. Using some macros to create little local environments that are aliasing free and things like that, Julia can probably get stupidly close to Fortran (like <1.2x) with minimum work (probably 10x-100x less code…).

So Fortran is faster? Not so fast… there’s a bias in the person who wrote the answer because they have experience building software in Fortran. Most people cannot write “optimal Fortran”. People point to BLAS and LAPACK as examples of Fortran, when BLAS and LAPACK are actually quite exceptional. Most Fortran code is not that optimal, and most Fortran code is a nightmare to use, understand, and contribute to. The suboptimal algorithms which come out of having a larger code base more than make up the difference, especially when you put in the fact that most mathematicians / scientists who are trying to write these things are not software engineers. “Software bloat” and maintenance issues quickly come up in this area.

But there is a secondary effect. In my field, numerical differential equations, the algorithm trumps optimizations. Sure, you want to get your implementation fast, but having the right algorithm makes a huge deal. For example, the newest methods for SDEs that I recently published gets about 100x over the “simple standard methods” in easy problems, and I give a real-world example (problem which someone was trying to solve in my lab) where the adaptivity algorithm gives about a 1e6 speedup. That’s huge! And no 1.2x matters at that point. But speedups using things like Verner’s newest algorithms with lazy adding of extra interpolation stages, just some new research in the oldest part of the field (ODEs), I have found that these high order RK methods can improve things like 10x-100x. However, since it takes so much more code in Fortran, it’s harder to maintain and update, which makes the codebase less agile and less able to add every new algorithm that’s available (the proof is very visible: no one has added these Verner algorithms to Fortran code even though they’ve been available for years, while DifferentialEquations.jl was a year long side project (turned somewhat “main project” of course :smile: )). And that’s to a detriment to performance because again, the gains by using specialized and improved mathematical algorithms are orders of magnitude larger than any Julia vs Fortran difference.

So sure, someone from a highly trained group of experienced Fortran/C++ programmers may look at this and say “but I can make it faster”, but I look at this and go “sure, but almost no one else can, almost no one else can help maintain it, and when researchers want to add new algorithms, it will be a nightmare to extend”. There’s an engineering tradeoff, and for me Julia is at a very optimal point for people when you don’t want to devote their whole life to software engineering for an extra 20% faster, and instead want to work on new algorithms and share their advances with usable software.

34 Likes

I think Chris’ point about real business use case reports is the most germane although they don’t always refer to legacy fortran.

Once again, Chris nails it on the head.
Another thing, Fortran might be fine if you only need 64-bit binary floats, but Julia can let you use so much more (ArbFloats, anybody?)

I know nothing about the source of the quote in question, but let me give you a real-world example in which I’ve often found that Julia code can be faster than comparable algorithms in production-quality Fortran: special-function implementations.

For example, I implemented an erfinv function Julia (add inverse error functions erfinv and erfcinv by stevengj · Pull Request #2987 · JuliaLang/julia · GitHub), and it was about 3x faster than Matlab or SciPy’s erfinv function, both of which are taken from standard Fortran libraries. (This is benchmarking single-threaded vectorized calls on large arrays where the Matlab/Python overhead should be negligible.) The underlying algorithm is similar to those used in the Fortran routines (in Matlab’s case this is only a guess), because almost everyone uses the same rational-function approximations published in the 1970s.

I have found similar gains (compared to Fortran code called in SciPy) for other special functions, e.g. polygamma functions (RFC: add complex polygamma and Hurwitz zeta functions by stevengj · Pull Request #7125 · JuliaLang/julia · GitHub) and exponential integrals (exponential integral (Ei, E₁, Eₙ...) function · Issue #19 · JuliaMath/SpecialFunctions.jl · GitHub).

The reason Julia can beat the Fortran code is that metaprogramming makes it easy to apply performance optimizations that are awkward in Fortran. We have metaprogramming macros (@evalpoly) that can easily inline polynomial evaluations, whereas the Fortran code makes function calls that loop over look-up tables of polynomial coefficients. Even greater speedups are possible for evaluating polynomials of complex arguments, where there is a fancy recurrence from Knuth that is almost impossible to use effectively without code generation. In principle, the Fortran authors could have done the same inlining and gotten similar performance, but the code would have been much more painful to write by hand. (They could even have written a program to generate Fortran code, but that is even more painful.)

29 Likes

For what it’s worth, I think that the explanation you just gave is infinitely preferable to what appears in that post.

I don’t think the original quote was that problematic if read in a reasonable way. Obviously, it was a situation in which someone ported some low-level kernel from Fortran to Julia, discovered some additional optimizations along the way (possibly made easier by Julia), and ended up with a 30% speedup. This kind of thing is not unreasonable — it happens all the time if you have a language in which you can write high-performance code. Whereas in Matlab/R/Python it is simply not possible to port low-level kernels from Fortran and get speedups unless you find truly fantastic algorithmic improvements [like going from O(n²) to O(n)] or find a new library routine (usually written in C or Fortran or similar!) that is perfectly suited to your problem. The fact that it is possible to do meaningful performance optimization of low-level kernels in Julia is the point here.

The NAG comment that it is impossible for Julia code to be faster than Fortran because they are both calling the same LAPACK/BLAS is just not reasonable, in my opinion. Dense linear algebra in every language has basically the same performance for this reason — obviously, this sort of code is not what any of the quotes was referring to.

5 Likes

It could also be a case where Julia was able to figure out optimizations based on specializing on types, that didn’t happen automatically in the original language (Fortran, or in my case, C).
The moment that I really fell in love with the Julia language was when I saw it take some string handling code and optimize it (as I might have done by hand in C, writing 3 or 4 separate functions), based on whether the type passed to my generic function was ASCIIString, UTF8String, UTF16String or UTF32String.
That was example of saving both human time (mine) and run-time, which I feel is Julia’s forte.

3 Likes

Also, the BLAS and LAPACK kernels that Julia uses are, in the important cases, not the reference BLAS written in Fortran but rather the multi-threaded BLAS/LAPACK in OpenBLAS and MKL. Indeed the big problem with reference BLAS and LAPACK is that they need to be written in Fortran 77.

I can say from experience that the biggest problem in porting R, and before that S, to new architectures was often the need to find a freely available Fortran compiler that was compatible with the local C compiler so that the BLAS/LAPACK code could be compiled.

I would encourage people interested in this topic to attend the Celeste keynote at JuliaCon on Thursday.
I’ll be talking a bit about how julia enables really high performance applications. I think there’s two major
aspects to performance here that is easily missed:

  • How efficient is the code that it generates for your target architecture

This is of course an on-going battle, but julia is getting VERY good at generating high-performance native code
for well-typed julia code (and there’s a number of changes yet in the pipeline). A lot of people get hung
up on this point, because for many languages (especially dynamic ones), this is the major point of
differentiation with high-performance languages (C/C++/Fortran, etc). The high performance languages
generate good good, the low performance ones don’t, end of story. For julia however, that’s not the
concern. Once your julia code is well-typed, the code that comes out is essentially the same
that would come out of a C/C++/Fortran compiler for the same algorithm (unsurprisingly of course,
because everyone uses LLVM nowadays). Nevertheless, that’s not the end of the performance story.

  • Data Layout, Cache Efficiency, etc.

Once the code generation story is settled, there’s still a significant amount of performance to be had
by optimizing for the hardware. Modern architectures are amazingly complicated and accommodating
their ticks (esp with respect to cache efficiency, vectorization) is absolutely required for high performance
(10-100x performance increases are possible here). People who do HPC in C/C++/Fortran know this
of course, and HPC apps written in those languages get heavily optimized with that in mind. However,
as Celeste demonstrates, the same is absolutely possible to do in Julia (and necessary for high performance).
The only remaining question then is how difficult this is to do. Personally, I find it orders of magnitude easier
to do in Julia. Static arrays, SoA transformations and even more fancy program transformations (some
of which I’ll talk about at the Celeste keynote), are essentially one line changes in julia - but significantly
harder in other languages.

To summarize, once you get into performance comparisons of HPC apps, whether it’s written in Julia, C, C++ or Fortran
doesn’t really matter for the performance. What matters is how well the app takes advantage of the hardware.
I’d argue that’s easier to do in julia. I would absolutely not be surprised to see a carefully optimized julia application
outperform unoptimized Fortran by 100x (and the reverse is also true of course, but for some reason, people
have an easier time believing that). The reason julia calls out to C and Fortran libraries (including BLAS), is that people
there have done the hard work of taking optimal advantage of the hardware, not because those languages generate better
code. In the future we may see people calling out to julia libraries from other languages.

24 Likes

As for the performance claims in the blog post. I understand where the uneasiness with those quotes comes from. I’ve
had the same discussion with our marketing folks. However, they are real quotes from our actual customers
that have replaced their legacy code bases with julia implementations. That’s of course not a “language speed
benchmark” and part of the improvement certainly comes from redoing the implementation and incorporating
learnings from the legacy solutions. So I’d take it for what it is. It’s testimonials from some
people who’ve switched to julia and saw huge benefits. I think we’d be remiss not to advertise that they
were able to do that.

3 Likes

I think changing “Julia provides” to “Julia has provided” fixes most of the problem here. The quotes from customers are of course valuable.

5 Likes

Sharing another experience. Cuba.jl is a package for numerical integration, a wrapper around the C library Cuba. Here is a comparison between runtimes of Julia, C, and Fortran programs calling the same library and integrating the same 11-element vectorial function in three dimensions with four different algorithms (Vegas, Suave, Divonne, Cuhre):

INFO: Performance of Cuba.jl:
  0.271304 seconds (Vegas)
  0.579783 seconds (Suave)
  0.329504 seconds (Divonne)
  0.238852 seconds (Cuhre)
INFO: Performance of Cuba Library in C:
  0.319799 seconds (Vegas)
  0.619774 seconds (Suave)
  0.340317 seconds (Divonne)
  0.266906 seconds (Cuhre)
INFO: Performance of Cuba Library in Fortran:
  0.272000 seconds (Vegas)
  0.584000 seconds (Suave)
  0.308000 seconds (Divonne)
  0.232000 seconds (Cuhre)

The Julia program is always faster than the C one, and in two cases faster than the Fortran one (in one case very close to it). The difference between the three programs is the computation of the integrand function, which is passed to the same library. In addition, the Julia program has less than 50 lines of code (and could have been even 40), the C program about 100 lines of code, and the Fortran program about 130 lines of code.

To summarize: with less than half the lines of code, in Julia you have similar or better performance than programs written in C and Fortran ultimately calling the same library for numerical integration.

Unfortunately, Cuba.jl cannot take advantage of parallelization capability of Cuba library because Julia doesn’t support fork. Would be cool to see how Cuba.jl performs in that case.

4 Likes

OT: Cubature.jl (similar to the Cuhre algorithm in Cuba) can give you a vector of points at which to evaluate your integrand function, instead of just one point at a time, so you could farm these out to parallel processes in Julia (with a work queue implemented in e.g. Julia RPC or MPI) or use threads etc.

3 Likes

Also Cuba supports vectorization (see Welcome to Read the Docs — Cuba.jl stable documentation), in addition it has parallelization. Your suggestion is interesting, thanks!

Slightly OT too: NIntegration.jl (which I started writing recently, and is still a WIP) implements the same algorithm that cuba and cubature (Cuhre) in pure Julia and is 2x time faster than the cuba version (in 3D).

It can also give you a weights and points if you want to use them in a parallel process.

8 Likes

This is a really interesting discussion, and I’m looking forward to the upcoming talks getting posted on YouTube so I can watch them and later use them to proselytize my friends :wink:.

I’m curious, does any of this have to do with the fact that Julia’s compiler is basically a static compiler and does not rely on run-time optimizations? I know that most other JIT’s rely on run-time optimizations in some way but I’m assuming there are fundamental limitations on how much benefit they can provide. Also, are the things that are being said of Julia generating efficient, platform-specific code things that can also be said of, for instance, Rust and Go?

Compared to, e.g. Python and R, one key difference is that well-written Julia code makes it possible to have good static compilation without losing genericity. In other dynamic languages, they have no choice but to do some kind of tracing JIT (unless you add type hints that make the code much less generic, ala Cython), because they need additional runtime information about which types are present in order to produce reasonable compiled code, and tracing JITs inherently make it difficult to predictably get good performance.

Sure. But those are statically typed languages that are poorly suited to interactive computation. Lots of static languages have good compilers.

1 Like