Greetings, Julia programmers and enthusiasts. I am a long-time C/Fortran/MATLAB/Python/R programmer, an old-time Julia enthusiast, and a recent Julia-news and events follower. However, I have no more than a few hundred lines of programming experience in Julia. I stumbled upon this medium post a few days ago, recounting the story of a team of researchers converting an old Fortran code to Julia and achieving 3X speedup. I asked the author of this blog post, Erik Engheim, about this report, but he did not respond. So, I wanted to ask the broader Julia community if they have any information about the circumstances around this report. Erik’s report gives me the impression that Julia is, on average, 3X faster than the equivalent Fortran code. If true, this is huge, and we should encourage everyone to port their C/Fortran codebases to Julia. But I suspect much of the speedup has been due to better algorithmic efficiency.
So, to keep scientific objectivity in research decision-making, I need more information about this particular performance report. I would appreciate your help if you can shed more light on this. Example questions of interest: Did the original Fortran authors also try to revise the Fortran code to match the (possibly) new algorithms used in the Julia version for this benchmark? Did the authors use some hardware accelerator to achieve the speedup? (that may seem naive, but I have read refereed research articles that compare the performance of a GPU-enabled Python code to the serial C/Fortran equivalent). Thank you for your help.
I’m not quite sure where or how you got that impression. In that post, @Erik_Engheim included an anecdote from Professor Alan Edelman:
At the SC19, the International Conference for Supercomputing in 2019, one of the Julia creators Alan Edelman recounts how a group at the Massachusetts Institute of Technology (MIT) rewrote part of their Fortran climate model into Julia. They determined ahead of time that they would tolerate a 3x slowdown of their code. That was an acceptable tradeoff to get access to a high-level language with higher productivity in their view. Instead they got a 3x speed boost by going over to Julia.
I don’t think Erik or Alan are making any claims that the algorithms are exactly the same or that Julia has 3x better performance than Fortran on average. It’s one case study. In fact I’d expect the algorithm isn’t exactly the same; given the context I’d bet they’re using smarter and more efficient DiffEq solvers than they were using in Fortran (but I have no inside knowledge here). In fact, Erik’s post is pretty great because it explicitly highlights the fact that Julia Performance is NOT Magic.
We do have a suite of microbenchmarks that try very hard to ensure the same algorithm is measured across multiple languages. You can check those out here: Julia Micro-Benchmarks. In general, top-tier performant Julia should be on roughly the same order of magnitude as top-tier performant static languages. There are cases where we do better and some where we do worse (but the latter are often considered bugs if they’re reported).
Blockquote I don’t think Erik or Alan are making any claims that the algorithms are exactly the same
That is the point of my post here: to make such reports more accurate by presenting a complete picture and details to not lead to follow-up questions like mine or wrong impressions. If someone writes a naive Julia code (as I would likely do) and reports a 3X speedup by porting it to Python without mentioning details of both sides, I am sure that would also raise a lot of questions for Julia programmers.
Blockquote We do have a suite of microbenchmarks that try very hard to ensure the same algorithm is measured across multiple languages. You can check those out here: Julia Micro-Benchmarks 1
Thanks for the link. I do not personally factor microbenchmarks in decision-making as long the results are not more than a few percent different.
On my (admittedly) limited experience, julia allows for code that can be as fast as C/Fortran code, but with similar complexity to higher level languages i.e python. It also allows for easier SIMD and multithreading than my experiences in other languages.
Obviously, it was a situation in which someone ported some low-level kernel from Fortran to Julia, discovered some additional optimizations along the way (possibly made easier by Julia), and ended up with a 30% speedup. This kind of thing is not unreasonable — it happens all the time if you have a language in which you can write high-performance code .
Moreover, that thread goes into several examples of ways in which Julia often makes it particularly easy to implement performance optimizations that are awkward (but of course not impossible) in Fortran.
Personally, I think having motivating anecdotes like this is almost more powerful than the same algorithm. It’s the real world. Sure, sometimes you are under constraints to do a bit-by-bit and branch-by-branch exact re-implementation into a new language, but that’s excruciating and only needed in the most exacting of circumstances. Most of the time re-writes happen is because it’s a living product that needs to grow and improve beyond the constraints of the existing system. These real world anecdotes aren’t coming from folks trying to convince others to use Julia or slam Fortran; they’re just trying to get their work done faster and easier.
One of the best examples of this I’ve seen is the NY Fed’s DSGE project: MATLAB to Julia Transition: Estimation · DSGE.jl. Ignore the language; the part that I like about it is that they separate out portions that they did more verbatim (like the Kalman filter — 33% faster) vs parts that they wholly redesigned (like the highly important Metropolis Hastings sampling step — nearly 10x faster).
I’d say having a language that gives space for such a beneficial redesign to happen is just as important — if not more so — than the exact performance comparison. Of course your upper limit will always be dependent upon how much blood, sweat, and tears you’ve poured into your previous work.
I fully agree with your point. But in this particular case, I think we are talking about some FORTRAN66 or FORTRAN77 codebases (again this is part of the question here, the details). A fair comparison in such a case would be to measure the development time in both languages, according to the lastest standards (in Fortran’s case it would be F2018), and not compare it to a codebase that belongs to the era of punchcards and hardware that are only in the museums now. That is the missing part in this report that I am looking for.
To be honest, the expectation of getting a 5x-100x speedup compared to a highly optimised Numpy code was overly optimistic, to say the least. They were disappointed to get only a 20% speedup with little Julia experience.
I wouldn’t expect large speedups in these cases. The 3x of the anecdote reported by Alan Edelman came as a surprise rather than an expectation, 3x was the accepted slowdown, in exchange of higher productivity.
I don’t think the author was expecting that magnitude of improvement, only testing to evaluate how much improvement would result from writing it in Julia.
I also don’t think it’s “overly optimistic” to think a 5x improvement is possible given that there are many examples of 5x+ improvements when replacing a system (in Julia or otherwise). Julia Computing itself quotes testimonials of Julia offering 100x and 1000x performance increases. Remember too that the comparison isn’t to a pure-Fortran application but to a Python-with-Numpy application.
I would be wary of falling into the trap of “Julia can get x-fold improvements and any proposed improvements larger than x are overly optimistic”.
no, at least not in the sense that one is doing O(n) and the other O(sqrt(n)).
And 100% not this
this is the issue. With C/Assembly, you will never be slower than any language, theoretically. But what a *useful comparison is similar code flow and both still being idiomatic and flexible in a useful way, within the language. Like when you write a for loop in C and a for loop in Python, nobody would say: “well python is not really slower than C because you can call C for loop”. People are comparing “how would you write a for loop, normally, in language X”.
Similarly, if you write 3x more lines of code in C/Fortran, I’m sure you can micro optimize to surpass Julia’s performance for a medium complicated application, for once. But that’s not the point.
@shahmoradi I am curious how you want to use this information. For example, are you in the situation where you’re about to start a large new project and you want to know whether to do it in modern Fortran or Julia? If so, what aspects of the languages are included in your “fair comparison” criteria (idiomatic performance, ecosystem size, etc)?
I am going to throw my hat in the ring and I know I am going to say stupid things. Please correct me.
At a low level the machine (CPU or GPU) is executing instructions as a low level and these execute in the same time as the machine really runs assembly code (machine code). That is not completely true on modern CPUs as the cores dynamically clock up and down (see my next post). Indeed there has been a recent debate over on the Beowulf list summarised as - is AVX512 worth it since the CPU clocks down when AVX512 instructions are in use.
At a level up from that better performance comes from the compiler - with Julia you are using the LLVM compiler. So any ‘dusty deck’ Fortran performance improvement may be coming from the compiler.
In my mind - what Julia does is to let the compiler ‘shine’ or ’ go to town’ by having multiple dispatch.
Does Julias excellent type system help with compiler optimisation too?
(*) I just had a bizarre thought. Could some smart folks write code which would exploit the varying frequencies of a modern CPU? Maybe for parts of the code you run on a single core as this will max out the frequency. In other parts you run multithreaded. I guess we actually do that in codes anyway- serial part to prepare data then multithreaded array processing.
Exploiting CPU frequency shifts would be completely non-deterministic. The CPU cores clock up and down due to thermal limits and these are subtly different per individual CPU.
I run HPL benchmarks and I believe Intel admit there is a several % variation in CPUs - this is no secret and is something we have to take into account.
running with my theme here we often see micro benchmarks run on personal or cloud systems in this forum. On HPC clusters you see bigger effects influencing your performance - the cooling for instance. I saw one presentation where the HPL performance was increased on a cluster due to water cooling.
Also the BIOS settings have a huge effect - you need to set the performance profile and accept your CPU will use more energy.
Also the AMD CPUs have ‘NUMA per Socket’ (NPS) settings - we reccomend NPS=4 for Rome and Milan.
Regarding CPU frequency one more tale - the Formula 1 teams have rules capping the amount of CPU power they can use for CFD studies. They have to consider the CPU frequencies. See section 9.3.4
FWIW, I think the lack of memory bandwidth on CPUs is a much bigger problem for AVX512 than downclocking is. Using AVX512 well typically gets dramatically better flops than AVX(2), regardless of any downclocking.
Conditional on memory bandwidth being sufficient, which for many workloads it isn’t, including the recent Julia vs Fortran and NumPy blog post involving complex exp (i.e., exp and sincos), which takes equally long as just assigning 1 to every element once optimized and using equal numbers of threads.
With the release of apple’s M1 chip, maybe we need to update our knowledge about performance.
I’m wondering maybe in the near future, we can do some high performance computation on laptop for individual researchers who cannot afford HPC clusters. I think the M1 chip has some advantage over the x86 platform GPU computation.
Since it is not a discrete GPU on PCIE, and uses uniform memory, there is no such cost to copy data from host to device, which allow us to write code in a more dynamic way without losing performance.
This is only a small example, there should be more difference on what is fast and what is slow between x86 and advanced arm chip, there maybe a potential for the arm chip to achieve high performance parallel computation.
Could this be the future?