I made a simple benchmark to try and convince colleagues to switch to Julia. It might be of interest to people, so here it is : https://github.com/antoine-levitt/benchmark_heat. It’s the most simple PDE code I could cook up (1D, explicit euler, finite differences), and it’s designed to highlight differences between languages for simple kernels, and in particular loops vs vectorized code. I compare C (gcc and clang with various flags), julia (loops and broadcast, different annotations/versions), and numpy/scilab/octave/matlab (loops and vector)
Some highlights of my very rudimentary benchmarking (best of 3 on my laptop):
There’s still a penalty for writing vector code in julia (doesn’t do SIMD?)
Julia vectorized is the fastest of any vectorized versions, and in particular has a factor of three speedup between 0.6 and 0.7!
Julia loops beats everybody, even the C version when compiled with clang (which I don’t understand)
Loops in matlab have gotten fast in the latest versions, to the point where they even beat the vectorized version. That must have been huge work, kudos to them!
Yup, these numbers are quite similar to the ones I’ve been seeing after writing tons of these things . You even get the Julia beating C part as well. I’ll have to start sharing this.
Not sure how to apply simd to a broadcast, isn’t it supposed to be before a loop?
@ChrisRackauckas how do you explain beating C? I don’t see what julia knows that C doesn’t here. Could it be that clang as a front-end does not produce as good a llvm IR as julia? gcc was about the same as julia in my example.
Here? No idea. Basically, our benchmarks routinely show we beat a lot of the C/Fortran methods by about 8x-10x. In this post:
I explain how most of that is explained by late binding (compiling the solver with the ODE function in that. A package which provides a solver can’t do this). However, there’s still this curious ~2x I don’t understand. In your example, you essentially do the “late binding” by writing it in there, but you also re-create the unexplained difference (at least when compiling with Clang). I said the numbers lined up with what I’ve seen (vs MATLAB/Python/etc as well), but there are some pieces of it that I don’t quite understand why they exist yet.
I am a bit confused: there is only one source file for each language. How did you construct the table? Shouldn’t there be multiple versions (vectorized, loops)?
If I may make a suggestion: generating multiple versions iin separate files might be preferable if it is to be generally useful to the readers of the discourse. Right now, at least for some of the codes, it is ambiguous how to derive any particular version. What to leave out and what to keep is obviously going to affect the timing (and possibly the correctness of the result as well).
I just made this to figure out roughly what the timings were, uploaded this to github to save it in case I want to look at it in a year, and posted it here in case people were interested. Anybody is very welcome to take the code and put it in a benchmarking suite if they wish, but I have no interest in making it systematic.