Julia Performance vs. C

jinliangwei · June 14, 2018, 3:40pm

I wonder if someone has taken a closer look at the Julia microbenchmarks or has some better understanding of what causes the performance difference between Julia and C.

I am guessing that Julia has a higher overhead for function calls, which is why Julia is 1.96X slower in recursion_fibbnacci which does almost no computation other than calling functions. This is probably part of the reason why matrix_statistics is slower too (note that the matrix size is just 5x5).

But what makes Julia faster than C in some cases?

Thanks!

Elrod · June 14, 2018, 4:14pm

In a Nim vs D thread, they said gcc is better at optimizing recursive functions than LLVM.
https://forum.nim-lang.org/t/1779#19088

This recursion unpacking/unrolling trick that gcc does (at call-site if insulated by call via volatile function ptr, and always inside the recursive impl) is, in my experience, a rare compiler optimization, but maybe it will catch on. clang does neither. If you objdump -D the executable (or disassemble in gdb/etc.) you will see the single callq to Fibonacci at the entry with the full N and a pair of callq inside the impl. So with clang the full 1.618**n work happens. On my i7-6700K Linux with gcc-7.1.0&clang-4.0.1, I get a time ratio of about 15.3 to 1 (53.5s/3.5s). -cblake

Nim compiles into C or C++, and then you can choose the C/C++ compiler from there. D uses LLVM.
Nim with gcc is far faster at fibonacci than D or Nim + Clang.
Another user reported (to which “cblake” was responding) :

first, for cc = gcc in nim.cfg


$ time ./a_gcc.exe
12586269025.0

real    0m5.164s
user    0m0.015s
sys     0m0.015s

Then use cc = clang in nim.cfg

$ time ./a_clang.exe
12586269025.0

real    2m2.055s
user    0m0.000s
sys     0m0.015s

Thats 5 seconds for gcc vs 2 minutes for Clang.
The difference we see with Julia is much smaller.

Odds are C would look more similar to Julia if Clang were used instead of gcc.

FWIW, in my experience, LLVM is far better at auto-vectorizing with avx than gcc.
This often makes it easier to get performance out of Julia when just crunching numbers.

ChrisRackauckas · June 14, 2018, 4:17pm

No.

JITs allow you to make a lot more assumptions than AOT, that’s the explanation so far.

It’s likely due to some optimization with the call stack? It can just be differences in compilers.

John_Gibson · June 14, 2018, 5:28pm

A comparison of gcc versus clang on the microbenchmarks. These are raw execution times, with gcc-7.31, clang-5.0.1, julia-0.7.0-DEV.

	C gcc	C clang	Julia
iteration_pi_sum	27.4	0.054	27.5
matrix_multiply	71.9	72.0	70.2
matrix_statistics	4.70	4.34	8.08
parse_integers	0.099	0.098	0.22
print_to_file	9.95	9.96	11.13
recursion_fibonacci	0.023	0.048	0.030
recursion_quicksort	0.26	0.25	0.26
userfunc_mandelbrot	0.077	0.064	0.054

clang beating gcc on iteration_pi_sum by almost three orders of magnitude is suspicious. Is clang noting that the sum is invariant and pulling it outside the loop over j?

double pisum() {
    double sum = 0.0;
    for (int j=0; j<500; ++j) {
        sum = 0.0;
        for (int k=1; k<=10000; ++k) {
            sum += 1.0/(k*k);
        }
    }
    return sum;
}

EDIT: The ratio of gcc time to clang time on iteration_pi_sum is 27/0.054 == 500. Seems pretty likely that clang is optimizing out the j=0; j<500 loop.

jinliangwei · June 14, 2018, 8:04pm

I see. Thanks, everybody for your answers!

StefanKarpinski · June 14, 2018, 9:04pm

There’s a delicate and ever-changing balancing act of allowing compilers to optimize but not too much in benchmarks. Since benchmarks tend to be constant, it’s often possible to completely constant fold them, but that’s not interesting to measure, so you want to prevent that; on the other hand it is a valid optimization. Bottom line: you should take microbenchmarks like this with a very large grain of salt—beyond seeing whether a runtime is in the “fast pack” or the “slow pack” there’s not that much you can conclude, especially between the faster end of the fast pack.

tkoolen · June 14, 2018, 10:45pm

I recently saw a video of a talk from CppCon 2015 on how to benchmark C++ code (and code in general, I guess). Here’s a part of that video where methods for ‘defeating the optimizer’ without incurring overhead are discussed: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!" - YouTube.

pauljurczak · June 15, 2018, 7:42am

John_Gibson:

double pisum() {
    double total = 0.0;
    for (int j=0; j<500; ++j) {
        double sum = 0.0;
        for (int k=1; k<=10000; ++k) {
            sum += 1.0/(k*k);
        }
        total += sum;
    }
    return total;
}

Try the code snippet above. The optimizers are way too smart for the original version.

jebej · June 15, 2018, 1:29pm

It seems Clang is optimizing out the 499 first inner loops, and only doing one of them, which makes sense since the 499 first loops don’t contribute to the final answer. If there was an other accumulator accumulating sum such that the result from each loop was used it would not get optimized away.

Topic		Replies	Views
Recursive Fibonacci Benchmark using top languages on Github Performance recursion	11	2847	September 29, 2018
Simple recursive Fibonaci example. How to make it faster? Performance	41	6930	December 5, 2024
Performance comparison of a custom gcd() function in Julia, C, and Rust Performance question , ffi	7	320	October 23, 2024
Intel C/C++ compiler performance versus Julia Offtopic	20	6239	August 11, 2021
Recursive call vs while loop Performance recursion	21	12260	April 12, 2022

Julia Performance vs. C

Related topics