Julia Performance vs. C

I wonder if someone has taken a closer look at the Julia microbenchmarks or has some better understanding of what causes the performance difference between Julia and C.

I am guessing that Julia has a higher overhead for function calls, which is why Julia is 1.96X slower in recursion_fibbnacci which does almost no computation other than calling functions. This is probably part of the reason why matrix_statistics is slower too (note that the matrix size is just 5x5).

But what makes Julia faster than C in some cases?

Thanks!

In a Nim vs D thread, they said gcc is better at optimizing recursive functions than LLVM.
https://forum.nim-lang.org/t/1779#19088

This recursion unpacking/unrolling trick that gcc does (at call-site if insulated by call via volatile function ptr, and always inside the recursive impl) is, in my experience, a rare compiler optimization, but maybe it will catch on. clang does neither. If you objdump -D the executable (or disassemble in gdb/etc.) you will see the single callq to Fibonacci at the entry with the full N and a pair of callq inside the impl. So with clang the full 1.618**n work happens. On my i7-6700K Linux with gcc-7.1.0&clang-4.0.1, I get a time ratio of about 15.3 to 1 (53.5s/3.5s). -cblake

Nim compiles into C or C++, and then you can choose the C/C++ compiler from there. D uses LLVM.
Nim with gcc is far faster at fibonacci than D or Nim + Clang.
Another user reported (to which ā€œcblakeā€ was responding) :

first, for cc = gcc in nim.cfg


$ time ./a_gcc.exe
12586269025.0

real    0m5.164s
user    0m0.015s
sys     0m0.015s

Then use cc = clang in nim.cfg

$ time ./a_clang.exe
12586269025.0

real    2m2.055s
user    0m0.000s
sys     0m0.015s

Thats 5 seconds for gcc vs 2 minutes for Clang.
The difference we see with Julia is much smaller.

Odds are C would look more similar to Julia if Clang were used instead of gcc.

FWIW, in my experience, LLVM is far better at auto-vectorizing with avx than gcc.
This often makes it easier to get performance out of Julia when just crunching numbers.

2 Likes

No.

JITs allow you to make a lot more assumptions than AOT, thatā€™s the explanation so far.

Itā€™s likely due to some optimization with the call stack? It can just be differences in compilers.

1 Like

A comparison of gcc versus clang on the microbenchmarks. These are raw execution times, with gcc-7.31, clang-5.0.1, julia-0.7.0-DEV.

C gcc C clang Julia
iteration_pi_sum 27.4 0.054 27.5
matrix_multiply 71.9 72.0 70.2
matrix_statistics 4.70 4.34 8.08
parse_integers 0.099 0.098 0.22
print_to_file 9.95 9.96 11.13
recursion_fibonacci 0.023 0.048 0.030
recursion_quicksort 0.26 0.25 0.26
userfunc_mandelbrot 0.077 0.064 0.054

clang beating gcc on iteration_pi_sum by almost three orders of magnitude is suspicious. Is clang noting that the sum is invariant and pulling it outside the loop over j?

double pisum() {
    double sum = 0.0;
    for (int j=0; j<500; ++j) {
        sum = 0.0;
        for (int k=1; k<=10000; ++k) {
            sum += 1.0/(k*k);
        }
    }
    return sum;
}

EDIT: The ratio of gcc time to clang time on iteration_pi_sum is 27/0.054 == 500. Seems pretty likely that clang is optimizing out the j=0; j<500 loop.

6 Likes

I see. Thanks, everybody for your answers!

Thereā€™s a delicate and ever-changing balancing act of allowing compilers to optimize but not too much in benchmarks. Since benchmarks tend to be constant, itā€™s often possible to completely constant fold them, but thatā€™s not interesting to measure, so you want to prevent that; on the other hand it is a valid optimization. Bottom line: you should take microbenchmarks like this with a very large grain of saltā€”beyond seeing whether a runtime is in the ā€œfast packā€ or the ā€œslow packā€ thereā€™s not that much you can conclude, especially between the faster end of the fast pack.

4 Likes

I recently saw a video of a talk from CppCon 2015 on how to benchmark C++ code (and code in general, I guess). Hereā€™s a part of that video where methods for ā€˜defeating the optimizerā€™ without incurring overhead are discussed: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!" - YouTube.

1 Like

Try the code snippet above. The optimizers are way too smart for the original version.

2 Likes

It seems Clang is optimizing out the 499 first inner loops, and only doing one of them, which makes sense since the 499 first loops donā€™t contribute to the final answer. If there was an other accumulator accumulating sum such that the result from each loop was used it would not get optimized away.