Julia Performance vs. C


#1

I wonder if someone has taken a closer look at the Julia microbenchmarks or has some better understanding of what causes the performance difference between Julia and C.

I am guessing that Julia has a higher overhead for function calls, which is why Julia is 1.96X slower in recursion_fibbnacci which does almost no computation other than calling functions. This is probably part of the reason why matrix_statistics is slower too (note that the matrix size is just 5x5).

But what makes Julia faster than C in some cases?

Thanks!


#2

In a Nim vs D thread, they said gcc is better at optimizing recursive functions than LLVM.
https://forum.nim-lang.org/t/1779#19088

This recursion unpacking/unrolling trick that gcc does (at call-site if insulated by call via volatile function ptr, and always inside the recursive impl) is, in my experience, a rare compiler optimization, but maybe it will catch on. clang does neither. If you objdump -D the executable (or disassemble in gdb/etc.) you will see the single callq to Fibonacci at the entry with the full N and a pair of callq inside the impl. So with clang the full 1.618**n work happens. On my i7-6700K Linux with gcc-7.1.0&clang-4.0.1, I get a time ratio of about 15.3 to 1 (53.5s/3.5s). -cblake

Nim compiles into C or C++, and then you can choose the C/C++ compiler from there. D uses LLVM.
Nim with gcc is far faster at fibonacci than D or Nim + Clang.
Another user reported (to which “cblake” was responding) :

first, for cc = gcc in nim.cfg


$ time ./a_gcc.exe
12586269025.0

real    0m5.164s
user    0m0.015s
sys     0m0.015s

Then use cc = clang in nim.cfg

$ time ./a_clang.exe
12586269025.0

real    2m2.055s
user    0m0.000s
sys     0m0.015s

Thats 5 seconds for gcc vs 2 minutes for Clang.
The difference we see with Julia is much smaller.

Odds are C would look more similar to Julia if Clang were used instead of gcc.

FWIW, in my experience, LLVM is far better at auto-vectorizing with avx than gcc.
This often makes it easier to get performance out of Julia when just crunching numbers.


#3

No.

JITs allow you to make a lot more assumptions than AOT, that’s the explanation so far.

It’s likely due to some optimization with the call stack? It can just be differences in compilers.


#4

A comparison of gcc versus clang on the microbenchmarks. These are raw execution times, with gcc-7.31, clang-5.0.1, julia-0.7.0-DEV.

C gcc C clang Julia
iteration_pi_sum 27.4 0.054 27.5
matrix_multiply 71.9 72.0 70.2
matrix_statistics 4.70 4.34 8.08
parse_integers 0.099 0.098 0.22
print_to_file 9.95 9.96 11.13
recursion_fibonacci 0.023 0.048 0.030
recursion_quicksort 0.26 0.25 0.26
userfunc_mandelbrot 0.077 0.064 0.054

clang beating gcc on iteration_pi_sum by almost three orders of magnitude is suspicious. Is clang noting that the sum is invariant and pulling it outside the loop over j?

double pisum() {
    double sum = 0.0;
    for (int j=0; j<500; ++j) {
        sum = 0.0;
        for (int k=1; k<=10000; ++k) {
            sum += 1.0/(k*k);
        }
    }
    return sum;
}

EDIT: The ratio of gcc time to clang time on iteration_pi_sum is 27/0.054 == 500. Seems pretty likely that clang is optimizing out the j=0; j<500 loop.


#5

I see. Thanks, everybody for your answers!


#6

There’s a delicate and ever-changing balancing act of allowing compilers to optimize but not too much in benchmarks. Since benchmarks tend to be constant, it’s often possible to completely constant fold them, but that’s not interesting to measure, so you want to prevent that; on the other hand it is a valid optimization. Bottom line: you should take microbenchmarks like this with a very large grain of salt—beyond seeing whether a runtime is in the “fast pack” or the “slow pack” there’s not that much you can conclude, especially between the faster end of the fast pack.


#7

I recently saw a video of a talk from CppCon 2015 on how to benchmark C++ code (and code in general, I guess). Here’s a part of that video where methods for ‘defeating the optimizer’ without incurring overhead are discussed: https://youtu.be/nXaxk27zwlk?t=40m40s.


#8

Try the code snippet above. The optimizers are way too smart for the original version.


#9

It seems Clang is optimizing out the 499 first inner loops, and only doing one of them, which makes sense since the 499 first loops don’t contribute to the final answer. If there was an other accumulator accumulating sum such that the result from each loop was used it would not get optimized away.