Small benchmark

I’m new to Julia, and was curious to test its famed performance, so today I made a small benchmark comparing it to C, optimized Python and Scala. Some pretty interesting results, with Julia falling a mere 3% behind the best C implementation of the code. I’m impressed!

My test was just calculating the exponential function using the same method found in glibc (“math.h”). Key to achieving the highest speeds was using @fastmath, but interestingly the C was pretty bad without that, I don’t know what optimizations I can be missing there. I’m still writing a blog about it (usually I write too much) but I would like to share the code and results in the forum right away. Any opinions and comments are appreciated.

9 Likes

I have already published the results from this experiment in a blog

9 Likes

Nice article! Some feedback on your benchmark:

  • You’re including a call to the built-in exp function in the timing, which presumably can vary a bit in performance between the different languages. Is that intentional? I think it’d be more interesting to just benchmark the myexp code.
  • You’re dividing by n for each iteration, which is quite slow. At least in Julia, you can save some time by instead multiplying by a pre-calculated 1/n. That could explain some of the anomalies you’re seeing with Scala, since the code looks a bit different there (you have 0.6/n instead of i/n, so perhaps the compiler is clever enough to do this optimization for you?).
  • How many times did you run this to produce the benchmark numbers? When I run your script over and over again, timings vary by a few percent between runs. So if you’re interested in performance differences as small as 3 %, you should probably run your experiment many times to ensure that the results are statistically sound. Perhaps you accounted for this already.
1 Like

Thanks for reading! I thinking leaving the system exp shouldn’t matter much because in the worse case all implementations would be using the best, well optimized version available, and in the worst case one of them might not, but this should definitely count as a demerit to that language. In the end it makes the benchmark a little more “broad-spectrum”, perhaps. About the division, I would also hope the compiler can pick that up, and if a language requires that care it should also be a demerit. But it would be definitely interesting to test if this was the case in any of them.

The tests were done only with a single run, not too careful. Some of the numbers were quite consistent, though. The only care I took was to run the test starting with larger batches of numbers to be more fair to JIT languages — we`re not interested in things like start-up time, after all. Once I have a better idea of how I might improve this benchmark in other ways I’ll definitely make a more careful measurements, and then we can look closer at that 3% difference.

Unoptimized julia (without @fastmath) is faster than C with O2. If that’s a sound result, it’s pretty impressive.
Edit: Did you try the plain for without @simd?

1 Like

Even better than -O3! It is pretty much the same as the “-O2 -finline-functions” there, and a similar performance was attained by Numba and Scala… Maybe there’s just some secret option I am missing that might also make -Ofast even faster,who knows, but that’s what I got here… Would be great to hear if anyone can reproduce this result. I also did not try Clang yet.

1 Like

The @simd actualy made practically no difference, I should remove it from that code.

1 Like

Sure, but it adds an unknown. For example, in Julia, you enable fast math by a decorator on the specific test method. In C you pass it to the compiler. Does that mean that you’re enabling fast math for the built-in exp in the C version, but not the Julia version? If so, it’s not a fair comparison. (And even if not, it still leaves me as a reader wondering.)

But you’ve coded it differently in the different languages. As I wrote, in Julia you have i/n and in some other languages you have 0.6/n. These are two very different expressions; the latter is a constant within the loop and can easily be optimized by a compiler. The former is not. In fact, if I change the Julia implementation to use 0.6/n, the run time for 1e9 iterations (for your method only, not the built-in exp) goes from 10 seconds to 6.7 seconds(!) So this likely explains why it seemed like Scala was faster than the other languages.

For Julia, take a look at BenchmarkTools. Or for something reproducible between various languages, consider running your test 100 times and reporting the minimum and median times.

1 Like

Indeed, the Scala loop is not consistent with the others, I’ll fix that in the next iteration, thanks!

I use functions like exp a lot in my code, so I’m definitely interesting in knowing how fast is my code with that, regardless of the reasons. It would be certainly interesting to know all these details for sure, but we would need multiple specific tests to understand that and I only had time for one at the moment.

edit Thinking a bit more about this, maybe a different exp is precisely the reason gcc -O3 was slower. I’ll try to benchmark just some pure functions like that later.

Including fast math? Enabling fast math is not a sign that one compiler or language is better than another. Among other things, it breaks IEEE compliance and only supports finite math. Yes, it can vastly speed up code, but I rarely find that I can actually use it in real world applications. Benchmarking one language with fast math enabled and one with it disabled (or semi-enabled), and drawing conclusions about general language performance, is nonsense IMO. If I take the Julia code in your article and replace the word @simd with @fastmath, the 1e9 test case goes from 17 seconds to 12 seconds on my system, which would completely obliterate any other benchmark in your article (assuming that you’re seeing the same timings, I haven’t tried your other implementations).

A few other notes:

  • Julia can also be started with the -O flag to control optimization level. It defaults to 2; for optimum performance, consider setting this to 3 (doesn’t affect your code on my system, but it’s a good habit).
  • Your Julia implementation seems to include a call to abs which is missing in the other languages?
  • Prefer System.nanoTime() over System.currentTimeMillis() for benchmarking in JVM-based languages.
  • The Python implementation using %timeit is unfair since it runs the code many times and selects the best time, while other languages just run it once. (I think it also disables GC, although that shouldn’t be an issue in your case since you’re not allocating memory.)

Hope I’m not coming across as too critical :slight_smile: I like your article, but accurate benchmarking is very difficult, so be careful making assumptions about results you get. Unexplainable results/anomalies are often caused by something wrong in the experiment itself, and not something on language-level.

1 Like

I appreciate the scrutiny, I hope you do realize I just cooked up all of this code during the weekend, just looking for a first clue about whether it is true we can get high performance with Julia. There are certainly many ways it can be improved, this is by far not a mature benchmark! I am pretty aware of how difficult it is, but we need to start somewhere. I am just sharing my results as soon as possible to get more feedback earlier on. Please think of this more like a collaborative effort looking for contributors than a final external report that must be contradicted.

The experiments are completely clear about where fastmath was used, and I definitely hope I am in complete control of this all of the time, and when I say I don’t care why it is faster I do not mean using things like fastmath or even SIMD parallelism. I mean a possible faster implementation with a different method for exp that is equally accurate. As a good scientist should expect.

I have actually already heard from the C team that maybe Julia, Scala and Numpy could be all using fastmath implicitly for exp(), and that is the reason they might have been faster than “C -O3”. A pretty serious accusation, yet it is hard to prove. But we’ll get to the bottom of this eventually.

Julia does not implicitly use fastmath. In fact, it uses its own Julia-based implementation in order to achieve ~1ulp since the system libraries do not always do so. So it has actually been shown that the Julia exp is more consistently accurate than the C stdlib versions, unless you add the @fastmath macro to change exp to Base.FastMath.exp_fast.

As native Julia code in an open source project, there is no “get to the bottom of this eventually”: you can get to the bottom of this right now by looking at the code yourself:

which is just standard Julia code. By clicking on the history for this file ( History for base/special/exp.jl - JuliaLang/julia · GitHub ) you can see every edit and discussion that has currently gone into the development. As you can see from the PRs, this functionality is from @musm and comes from Amal.jl, his libm testing ground. In that repository you can actually see and run the benchmark which was used to verify the accuracy to 1ulp:

This same setup was applied to research other Libms, like SLEEF (rewritten as Sleef.jl: GitHub - musm/SLEEF.jl: A pure Julia port of the SLEEF math library ), and was able to uncover inaccuracies in other libms such as Log for small subnormals not accurate to within 1ulp · Issue #2 · musm/SLEEF.jl · GitHub .

So, knowing that all of this is online, what evidence did the C team give to state that Julia is implicitly using something equivalent to their ~3ulp fastmath?

5 Likes

Thanks a lot for your great answer. Indeed, looking into the Julia code has been the easiest and most enjoyable part of the investigation so far.

Reading to code of Julia is almost always a pretty good idea, since it’s almost 70% pure Julia (as of today):

So for learning how to do stuff in Julia it’s a good reference if you’re comfortable reading other people’s code.

1 Like

Even that is a very low estimate too. The C and C++ is the Julia runtime and the Scheme is the Julia parser. In terms of Julia itself, what’s not implemented in Julia code is essentially the type system and its types like DataType, the Array type, and the expression types. The rest is all defined in Julia.

To show this, look at the top of boot.jl. This is the first file run on a Julia startup, and it comments on the top everything that exists in Julia prior to the Julia-defined Base code. This is 141 lines:

(though it leaves out a few primitive functions like eval). The rest is all defined in Julia, starting with integers and numbers:

The only caveat is the later portions of the stdlib defined by bindings to things like BLAS and SuiteSparse, or any packages you use which bind to binaries.

So seeing that, I think it’s fair to say that the Julia one actually interacts with is almost entirely written and defined in Julia, likely >95%. The only pieces that people actually regularly touch that are define pre-Julia are the Array, Union, and Expr types. Even then, all of the functions on them, including the normal constructors, are defined in Julia.

6 Likes