Small benchmark

xor0110 · November 18, 2018, 3:23am

I’m new to Julia, and was curious to test its famed performance, so today I made a small benchmark comparing it to C, optimized Python and Scala. Some pretty interesting results, with Julia falling a mere 3% behind the best C implementation of the code. I’m impressed!

My test was just calculating the exponential function using the same method found in glibc (“math.h”). Key to achieving the highest speeds was using @fastmath, but interestingly the C was pretty bad without that, I don’t know what optimizations I can be missing there. I’m still writing a blog about it (usually I write too much) but I would like to share the code and results in the forum right away. Any opinions and comments are appreciated.

gist.github.com

https://gist.github.com/nlw0/e3c94ffd816b5592736c0e822f0424be

myexp.jl

@inline @fastmath function myexpm1(x :: Float64)
    ln2_hi = 6.93147180369123816490e-01
    ln2_lo = 1.90821492927058770002e-10
    one = 1.0
    Q1 = -3.33333333333331316428e-02
    Q2 =  1.58730158725481460165e-03
    Q3 = -7.93650757867487942473e-05
    Q4 =  4.00821782732936239552e-06
    Q5 = -2.01099218183624371326e-07
    hi = x - ln2_hi

This file has been truncated. show original

gist.github.com

https://gist.github.com/nlw0/7ee419b1e1c0fb11c5b5a1f6e0e35f81

myexp.c

#include<stdio.h>
#include<math.h>

// thanks Joe from stackoverflow
#include <sys/time.h>
#include <sys/resource.h>

double get_time()
{
    struct timeval t;

This file has been truncated. show original

xor0110 · November 18, 2018, 5:11pm

I have already published the results from this experiment in a blog

bennedich · November 19, 2018, 7:43pm

Nice article! Some feedback on your benchmark:

You’re including a call to the built-in exp function in the timing, which presumably can vary a bit in performance between the different languages. Is that intentional? I think it’d be more interesting to just benchmark the myexp code.
You’re dividing by n for each iteration, which is quite slow. At least in Julia, you can save some time by instead multiplying by a pre-calculated 1/n. That could explain some of the anomalies you’re seeing with Scala, since the code looks a bit different there (you have 0.6/n instead of i/n, so perhaps the compiler is clever enough to do this optimization for you?).
How many times did you run this to produce the benchmark numbers? When I run your script over and over again, timings vary by a few percent between runs. So if you’re interested in performance differences as small as 3 %, you should probably run your experiment many times to ensure that the results are statistically sound. Perhaps you accounted for this already.

xor0110 · November 20, 2018, 8:27pm

Thanks for reading! I thinking leaving the system exp shouldn’t matter much because in the worse case all implementations would be using the best, well optimized version available, and in the worst case one of them might not, but this should definitely count as a demerit to that language. In the end it makes the benchmark a little more “broad-spectrum”, perhaps. About the division, I would also hope the compiler can pick that up, and if a language requires that care it should also be a demerit. But it would be definitely interesting to test if this was the case in any of them.

The tests were done only with a single run, not too careful. Some of the numbers were quite consistent, though. The only care I took was to run the test starting with larger batches of numbers to be more fair to JIT languages — we`re not interested in things like start-up time, after all. Once I have a better idea of how I might improve this benchmark in other ways I’ll definitely make a more careful measurements, and then we can look closer at that 3% difference.

e3c6 · November 20, 2018, 8:49pm

Unoptimized julia (without @fastmath) is faster than C with O2. If that’s a sound result, it’s pretty impressive.
Edit: Did you try the plain for without @simd?

xor0110 · November 20, 2018, 9:04pm

Even better than -O3! It is pretty much the same as the “-O2 -finline-functions” there, and a similar performance was attained by Numba and Scala… Maybe there’s just some secret option I am missing that might also make -Ofast even faster,who knows, but that’s what I got here… Would be great to hear if anyone can reproduce this result. I also did not try Clang yet.

xor0110 · November 20, 2018, 9:06pm

The @simd actualy made practically no difference, I should remove it from that code.

bennedich · November 20, 2018, 10:28pm

Sure, but it adds an unknown. For example, in Julia, you enable fast math by a decorator on the specific test method. In C you pass it to the compiler. Does that mean that you’re enabling fast math for the built-in exp in the C version, but not the Julia version? If so, it’s not a fair comparison. (And even if not, it still leaves me as a reader wondering.)

But you’ve coded it differently in the different languages. As I wrote, in Julia you have i/n and in some other languages you have 0.6/n. These are two very different expressions; the latter is a constant within the loop and can easily be optimized by a compiler. The former is not. In fact, if I change the Julia implementation to use 0.6/n, the run time for 1e9 iterations (for your method only, not the built-in exp) goes from 10 seconds to 6.7 seconds(!) So this likely explains why it seemed like Scala was faster than the other languages.

For Julia, take a look at BenchmarkTools. Or for something reproducible between various languages, consider running your test 100 times and reporting the minimum and median times.

xor0110 · November 21, 2018, 9:24am

Indeed, the Scala loop is not consistent with the others, I’ll fix that in the next iteration, thanks!

I use functions like exp a lot in my code, so I’m definitely interesting in knowing how fast is my code with that, regardless of the reasons. It would be certainly interesting to know all these details for sure, but we would need multiple specific tests to understand that and I only had time for one at the moment.

edit Thinking a bit more about this, maybe a different exp is precisely the reason gcc -O3 was slower. I’ll try to benchmark just some pure functions like that later.

bennedich · November 21, 2018, 1:22pm

Including fast math? Enabling fast math is not a sign that one compiler or language is better than another. Among other things, it breaks IEEE compliance and only supports finite math. Yes, it can vastly speed up code, but I rarely find that I can actually use it in real world applications. Benchmarking one language with fast math enabled and one with it disabled (or semi-enabled), and drawing conclusions about general language performance, is nonsense IMO. If I take the Julia code in your article and replace the word @simd with @fastmath, the 1e9 test case goes from 17 seconds to 12 seconds on my system, which would completely obliterate any other benchmark in your article (assuming that you’re seeing the same timings, I haven’t tried your other implementations).

A few other notes:

Julia can also be started with the -O flag to control optimization level. It defaults to 2; for optimum performance, consider setting this to 3 (doesn’t affect your code on my system, but it’s a good habit).
Your Julia implementation seems to include a call to abs which is missing in the other languages?
Prefer System.nanoTime() over System.currentTimeMillis() for benchmarking in JVM-based languages.
The Python implementation using %timeit is unfair since it runs the code many times and selects the best time, while other languages just run it once. (I think it also disables GC, although that shouldn’t be an issue in your case since you’re not allocating memory.)

Hope I’m not coming across as too critical I like your article, but accurate benchmarking is very difficult, so be careful making assumptions about results you get. Unexplainable results/anomalies are often caused by something wrong in the experiment itself, and not something on language-level.

xor0110 · November 21, 2018, 2:09pm

I appreciate the scrutiny, I hope you do realize I just cooked up all of this code during the weekend, just looking for a first clue about whether it is true we can get high performance with Julia. There are certainly many ways it can be improved, this is by far not a mature benchmark! I am pretty aware of how difficult it is, but we need to start somewhere. I am just sharing my results as soon as possible to get more feedback earlier on. Please think of this more like a collaborative effort looking for contributors than a final external report that must be contradicted.

The experiments are completely clear about where fastmath was used, and I definitely hope I am in complete control of this all of the time, and when I say I don’t care why it is faster I do not mean using things like fastmath or even SIMD parallelism. I mean a possible faster implementation with a different method for exp that is equally accurate. As a good scientist should expect.

I have actually already heard from the C team that maybe Julia, Scala and Numpy could be all using fastmath implicitly for exp(), and that is the reason they might have been faster than “C -O3”. A pretty serious accusation, yet it is hard to prove. But we’ll get to the bottom of this eventually.

ChrisRackauckas · November 21, 2018, 3:00pm

Julia does not implicitly use fastmath. In fact, it uses its own Julia-based implementation in order to achieve ~1ulp since the system libraries do not always do so. So it has actually been shown that the Julia exp is more consistently accurate than the C stdlib versions, unless you add the @fastmath macro to change exp to Base.FastMath.exp_fast.

As native Julia code in an open source project, there is no “get to the bottom of this eventually”: you can get to the bottom of this right now by looking at the code yourself:

github.com

JuliaLang/julia/blob/master/base/special/exp.jl

# This file is a part of Julia. License is MIT: https://julialang.org/license

# magic rounding constant: 1.5*2^52 Adding, then subtracting it from a float rounds it to an Int.
# This works because eps(MAGIC_ROUND_CONST(T)) == one(T), so adding it to a smaller number aligns the lsb to the 1s place.
# Values for which this trick doesn't work are going to have outputs of 0 or Inf.
MAGIC_ROUND_CONST(::Type{Float64}) = 6.755399441055744e15
MAGIC_ROUND_CONST(::Type{Float32}) = 1.048576f7

# max, min, and subnormal arguments
# max_exp = T(exponent_bias(T)*log(base, big(2)) + log(base, 2 - big(2.0)^-significand_bits(T)))
MAX_EXP(n::Val{2}, ::Type{Float32}) = 128.0f0
MAX_EXP(n::Val{2}, ::Type{Float64}) = 1024.0
MAX_EXP(n::Val{:ℯ}, ::Type{Float32}) = 88.72284f0
MAX_EXP(n::Val{:ℯ}, ::Type{Float64}) = 709.7827128933841
MAX_EXP(n::Val{10}, ::Type{Float32}) = 38.53184f0
MAX_EXP(n::Val{10}, ::Type{Float64}) = 308.25471555991675

# min_exp = T(-(exponent_bias(T)+significand_bits(T)) * log(base, big(2)))
MIN_EXP(n::Val{2}, ::Type{Float32}) = -150.0f0
MIN_EXP(n::Val{2}, ::Type{Float64}) = -1075.0

This file has been truncated. show original

which is just standard Julia code. By clicking on the history for this file ( History for base/special/exp.jl - JuliaLang/julia · GitHub ) you can see every edit and discussion that has currently gone into the development. As you can see from the PRs, this functionality is from @musm and comes from Amal.jl, his libm testing ground. In that repository you can actually see and run the benchmark which was used to verify the accuracy to 1ulp:

github.com

musm/Amal.jl/blob/master/benchmark/benchmark.jl

using Amal
# using Sleef
using BenchmarkTools, JLD 
using DataStructures, Suppressor 

testlib = "Amal"
reflib  = "Base"
test_types = (Float64, Float32) # Which types do you want to bench?

const RETUNE  = false
const VERBOSE = true
const DETAILS = false

bench_reduce(f::Function, X) = mapreduce(x -> reinterpret(Unsigned,x), |, f(x) for x in X)

function modulefunex(str) # convert e.g. "Libm.Cephes.exp" to an Expr you can actually call :P
    g = split(str,".")
    if length(g) > 1
        return Expr(:., modulefunex(join(g[1:end-1],".")), QuoteNode(Symbol(g[end])))
    end

This file has been truncated. show original

This same setup was applied to research other Libms, like SLEEF (rewritten as Sleef.jl: GitHub - musm/SLEEF.jl: A pure Julia port of the SLEEF math library ), and was able to uncover inaccuracies in other libms such as Log for small subnormals not accurate to within 1ulp · Issue #2 · musm/SLEEF.jl · GitHub .

So, knowing that all of this is online, what evidence did the C team give to state that Julia is implicitly using something equivalent to their ~3ulp fastmath?

xor0110 · November 21, 2018, 3:12pm

Thanks a lot for your great answer. Indeed, looking into the Julia code has been the easiest and most enjoyable part of the investigation so far.

Sukera · November 21, 2018, 3:18pm

Reading to code of Julia is almost always a pretty good idea, since it’s almost 70% pure Julia (as of today):

So for learning how to do stuff in Julia it’s a good reference if you’re comfortable reading other people’s code.

ChrisRackauckas · November 21, 2018, 4:37pm

Even that is a very low estimate too. The C and C++ is the Julia runtime and the Scheme is the Julia parser. In terms of Julia itself, what’s not implemented in Julia code is essentially the type system and its types like DataType, the Array type, and the expression types. The rest is all defined in Julia.

To show this, look at the top of boot.jl. This is the first file run on a Julia startup, and it comments on the top everything that exists in Julia prior to the Julia-defined Base code. This is 141 lines:

github.com

JuliaLang/julia/blob/master/base/boot.jl#L1-L141


      
          # This file is a part of Julia. License is MIT: https://julialang.org/license
          
          # commented-out definitions are implemented in C
          
          #abstract type Any <: Any end
          #abstract type Type{T} end
          
          #abstract type Vararg{T} end
          
          #mutable struct Symbol
          ## opaque
          #end
          
          #mutable struct TypeName
          #    name::Symbol
          #end
          
          #mutable struct DataType <: Type
          #    name::TypeName
          #    super::Type

This file has been truncated. show original

(though it leaves out a few primitive functions like eval). The rest is all defined in Julia, starting with integers and numbers:

github.com

JuliaLang/julia/blob/master/base/boot.jl#L179-L203


      
          #    state::Symbol
          #    donenotify::Any
          #    result::Any
          #    exception::Any
          #    backtrace::Any
          #    scope::Any
          #    code::Any
          #end
          
          export
              # key types
              Any, DataType, Vararg, NTuple,
              Tuple, Type, UnionAll, TypeVar, Union, Nothing, Cvoid,
              AbstractArray, DenseArray, NamedTuple, Pair,
              # special objects
              Function, Method, Array, Memory, MemoryRef, GenericMemory, GenericMemoryRef,
              Module, Symbol, Task, UndefInitializer, undef, WeakRef, VecElement,
              # numeric types
              Number, Real, Integer, Bool, Ref, Ptr,
              AbstractFloat, Float16, Float32, Float64,

This file has been truncated. show original

The only caveat is the later portions of the stdlib defined by bindings to things like BLAS and SuiteSparse, or any packages you use which bind to binaries.

So seeing that, I think it’s fair to say that the Julia one actually interacts with is almost entirely written and defined in Julia, likely >95%. The only pieces that people actually regularly touch that are define pre-Julia are the Array, Union, and Expr types. Even then, all of the functions on them, including the normal constructors, are defined in Julia.

Topic		Replies	Views
[ANN] New benchmark comparison General Usage benchmark	5	1038	March 13, 2018
Benchmarks game Performance	20	3782	May 13, 2020
Programming Language Benchmark 2 Performance	25	3379	April 8, 2024
Funny Benchmark with Julia (no longer) at the bottom Performance benchmark	149	6030	November 4, 2023
Yet another language benchmark Performance benchmark	9	908	June 15, 2025

Related topics