I wonder if someone has taken a closer look at the Julia microbenchmarks or has some better understanding of what causes the performance difference between Julia and C.
I am guessing that Julia has a higher overhead for function calls, which is why Julia is 1.96X slower in recursion_fibbnacci which does almost no computation other than calling functions. This is probably part of the reason why matrix_statistics is slower too (note that the matrix size is just 5x5).
This recursion unpacking/unrolling trick that gcc does (at call-site if insulated by call via volatile function ptr, and always inside the recursive impl) is, in my experience, a rare compiler optimization, but maybe it will catch on. clang does neither. If you objdump -D the executable (or disassemble in gdb/etc.) you will see the single callq to Fibonacci at the entry with the full N and a pair of callq inside the impl. So with clang the full 1.618**n work happens. On my i7-6700K Linux with gcc-7.1.0&clang-4.0.1, I get a time ratio of about 15.3 to 1 (53.5s/3.5s). -cblake
Nim compiles into C or C++, and then you can choose the C/C++ compiler from there. D uses LLVM.
Nim with gcc is far faster at fibonacci than D or Nim + Clang.
Another user reported (to which ācblakeā was responding) :
first, for cc = gcc in nim.cfg
$ time ./a_gcc.exe
12586269025.0
real 0m5.164s
user 0m0.015s
sys 0m0.015s
Then use cc = clang in nim.cfg
$ time ./a_clang.exe
12586269025.0
real 2m2.055s
user 0m0.000s
sys 0m0.015s
Thats 5 seconds for gcc vs 2 minutes for Clang.
The difference we see with Julia is much smaller.
Odds are C would look more similar to Julia if Clang were used instead of gcc.
FWIW, in my experience, LLVM is far better at auto-vectorizing with avx than gcc.
This often makes it easier to get performance out of Julia when just crunching numbers.
A comparison of gcc versus clang on the microbenchmarks. These are raw execution times, with gcc-7.31, clang-5.0.1, julia-0.7.0-DEV.
C gcc
C clang
Julia
iteration_pi_sum
27.4
0.054
27.5
matrix_multiply
71.9
72.0
70.2
matrix_statistics
4.70
4.34
8.08
parse_integers
0.099
0.098
0.22
print_to_file
9.95
9.96
11.13
recursion_fibonacci
0.023
0.048
0.030
recursion_quicksort
0.26
0.25
0.26
userfunc_mandelbrot
0.077
0.064
0.054
clang beating gcc on iteration_pi_sum by almost three orders of magnitude is suspicious. Is clang noting that the sum is invariant and pulling it outside the loop over j?
double pisum() {
double sum = 0.0;
for (int j=0; j<500; ++j) {
sum = 0.0;
for (int k=1; k<=10000; ++k) {
sum += 1.0/(k*k);
}
}
return sum;
}
EDIT: The ratio of gcc time to clang time on iteration_pi_sum is 27/0.054 == 500. Seems pretty likely that clang is optimizing out the j=0; j<500 loop.
Thereās a delicate and ever-changing balancing act of allowing compilers to optimize but not too much in benchmarks. Since benchmarks tend to be constant, itās often possible to completely constant fold them, but thatās not interesting to measure, so you want to prevent that; on the other hand it is a valid optimization. Bottom line: you should take microbenchmarks like this with a very large grain of saltābeyond seeing whether a runtime is in the āfast packā or the āslow packā thereās not that much you can conclude, especially between the faster end of the fast pack.
It seems Clang is optimizing out the 499 first inner loops, and only doing one of them, which makes sense since the 499 first loops donāt contribute to the final answer. If there was an other accumulator accumulating sum such that the result from each loop was used it would not get optimized away.