Other languages (such as Java) have several entries on the benchmarks.
We could also have several versions: plain Julia, Julia with some package, Optimized Julia…
I like the idea of using packages. Especially StaticArrays because it is always repeated that packages are not second grade citizens in the julia community.
Also, the vast majority who look at the source code of Julia programs on the benchmarks game website will be unfamiliar with the language — so inline comments which explain language elements particular to Julia can be very helpful.
While I think it would be nice to show how easy it is to re-implement the limited functionality needed for these benchmarks without using external packages, there is precedent for using non-standard packages: Java uses it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap
in the k-nucleotide benchmark: k-nucleotide Java program (Benchmarks Game).
By the way, BioJulia people, step up your game. Java looks beatable for k-nucleotide in terms of both readability and performance, currently way behind in the latter category.
This community is amazing! I cloned the repo today to try to improve some tests and I found lots of commits with lots of improvements just 3 days after! Amazing!
@kristoffer.carlsson any idea about what is left to do?
Mandelbrot, binary trees nbodies, fannkuchredux should be in relatively good shape now… If i didn’t miss anything, the rest should still leave quite a bit of room for improvement!
The bottleneck in fannkuch-redux is the function count_flips
which makes lots of array accesses. I guess the fastest implementations use loop-unrolling techniques to utilize out-of-order execution stuff more, specifically on the loop for k = 1:15
whose maximum number of iterations is (made) guaranteed at compile-time. I tried @nexprs
but could not get an improvement maybe someone can take a look at there.
fyi Programs with manual loop unrolling are generally not accepted — that can become a grey area when there are vector operations.
fyi Programs that use fastmath are not accepted.
What is meant by manual loop unrolling, typing what each iteration should run one-by-one? Is letting a built-in macro (if any) do this kind of optimization accepted as manual loop unrolling, given that it is not done for a specific input?
Some compilers unrolls loops at compile time which I think in the same way such a macro would do and some implementations explicitly or implicitly use this feature.
Does it show what a clever programmer could do, or does it show what the language implementation does?
It’s kind of both, isn’t it? Please take a look at this C implementation. As noted in the code, low_index<16
is put there violating the algorithm purposefully to utilize loop-unrolling (since otherwise the compiler cannot unroll a loop that can run an unknown number of times), which is, I would argue, something that a clever programmer could do. Other implementations also make this 16
trick.
I can also argue that an implementation in Julia using macros can show what the language implementation does.
An implementation in Julia doesn’t need that. It can just use StaticArrays which should do that.
Another point to merge StaticArrays into stdlib
The fastest C implementation makes use of this trick distance = _mm_cvtps_pd(_mm_rsqrt_ps(_mm_cvtpd_ps(dsquared)))
to drop the precision to Float32
so that the SSE instruction for approximate reciprocal square root can be used, which doesn’t exist for Float64
until the AVX-512 extension I gather looking at Intel intrinsics guide.
This is interesting. If julia unrolls loops with StaticArrays, should not it also unroll loops like for i = 1:16
? Perhaps, it already is unrolling.
Also, it was said we avoid using packages.
Afaik @fastmath
won’t emit this; see here if you want to use that trick. I wouldn’t call this idiomatic julia, though.
Is this an official rule or the benchmarks game?
Honestly, I have no idea, but people so far avoided it. I referred to this post above.
I don’t think these are unrolled by default, but Unrolled.jl has a macro which does this: Unrolled.@unroll_loop for i=1:16 ...
will work, and in some quick tests can be faster. (It’s a one-line macro, btw, so would be trivial to include without having to call the package.)
Edit: then I read the rest of the thread, sorry! Does this count as manual or automated unrolling?