Julia programs now shown on benchmarks game website

It’s kind of both, isn’t it? Please take a look at this C implementation. As noted in the code, low_index<16 is put there violating the algorithm purposefully to utilize loop-unrolling (since otherwise the compiler cannot unroll a loop that can run an unknown number of times), which is, I would argue, something that a clever programmer could do. Other implementations also make this 16 trick.

I can also argue that an implementation in Julia using macros can show what the language implementation does.

1 Like

An implementation in Julia doesn’t need that. It can just use StaticArrays which should do that.

3 Likes

Another point to merge StaticArrays into stdlib :slight_smile:

13 Likes

The fastest C implementation makes use of this trick distance = _mm_cvtps_pd(_mm_rsqrt_ps(_mm_cvtpd_ps(dsquared))) to drop the precision to Float32 so that the SSE instruction for approximate reciprocal square root can be used, which doesn’t exist for Float64 until the AVX-512 extension I gather looking at Intel intrinsics guide.

1 Like

This is interesting. If julia unrolls loops with StaticArrays, should not it also unroll loops like for i = 1:16? Perhaps, it already is unrolling.

Also, it was said we avoid using packages.

Afaik @fastmath won’t emit this; see here if you want to use that trick. I wouldn’t call this idiomatic julia, though.

Is this an official rule or the benchmarks game?

Honestly, I have no idea, but people so far avoided it. I referred to this post above.

I don’t think these are unrolled by default, but Unrolled.jl has a macro which does this: Unrolled.@unroll_loop for i=1:16 ... will work, and in some quick tests can be faster. (It’s a one-line macro, btw, so would be trivial to include without having to call the package.)

Edit: then I read the rest of the thread, sorry! Does this count as manual or automated unrolling?

With regards to using packages, the benchmarks are so simple there isn’t much point to packages imo, and there is value in having completely self contained examples that can be copy pasted.

Of course one could use e.g StaticArrays.jl for the nobody benchmark but showing how one can create their own static vector in a few lines shows off julia more than hiding it behind a package imo.

12 Likes

See also my Julia needs work at "Benchmark Game": numbers seem off up to 100x slower, maybe "Julia AOT" entries needed?

Aren’t the programs in Kristofer’s repo in good shape? I believe some or all of them should be submitted. I guess I could (but should I)? I don’t want to take anyone’s credit.

I decided to look into one of the programs that didn’t have a Julia implementation (I see however now it’s in Kristofer’s repo), and to translate this fastest [C++] non-multi-threaded version:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/spectralnorm-gpp-2.html

It uses Eigen library, and since that’s allowed, I guess we could use similar [Julia] libraries, or even the Julia Eigen wrapper.

However, I just started with translating the commented out “EQUIVALENT MATLAB IMPLEMENTATION”. It’s trivial to do, but I opted to try the http://sciencecow.mit.edu/matlab-to-julia/ translator anyway. I ended up with this (maybe it just isn’t “EQUIVALENT”, anyone know what’s wrong?) that’s fast on my very old laptop but wrong:

julia> function approximate(n)
       A = zeros(n,n)
         for i=1:n
           for j=1:n
             A[i,j] = 1.0/((i+j)*(i+j+1)/2.0 + i+1.0)
           end
         end
         u = ones(n,1)
         v = zeros(n,1)
         for i=1:10
           v = A'*(A*u)
           u = A'*(A*v)
         end
         sqrt((u'*v)/(v'*v))
       end
approximate (generic function with 1 method)

julia> @time approximate(5500)
  5.461645 seconds (9.13 k allocations: 232.940 MB, 0.72% gc time)
1×1 Array{Float64,2}:
 0.361967

The translator got this line wrong (anyone know why? and I should file a bug):

sqrt((u"*v)/(v"*v))

and I changed integers to floats here: A[i,j] = 1.0/((i+j)*(i+j+1)/2.0 + i+1.0)

I got the same result as expected and seems same speed (was just checking).

I opened um my MATLAB clone, Octave to just make sure, and I get the same number eventually. Probably after minutes, ten[s?]. At least Julia is way faster, can anyone compare real MATLAB to Julia for me?

You access the matrix in a suboptimal pattern. Try switching the order of the for loops

Thanks, I assumed they had good MATLAB code, and I didn’t look into row vs. column major (I assumed they did differently, though stating “EQUIVALENT”).

Still of course I get the same wrong result.

This however doesn’t change the timing for Julia at least. And for Octave I gave up waiting.

For lower number/quicker I get about 6 to 7 seconds either way for 550 (sometimes the “better” version is faster sometimes the other).

While in Julia I get:

julia> @time approximate(550)
  0.049661 seconds (70 allocations: 2.491 MB)
1×1 Array{Float64,2}:
 0.361967

There is also a “Recursive Fibonacci Benchmark using top languages on Github”.

It would be nice to do it better.

From HN"ok, owner of the repo here. So this project was purely to show the macro differences between interpreted ruby and compiled crystal to beginner ruby devs at a meet up.

I didn’t expect to get a memoized version of every language and really don’t think comparing them from a performance benchmark makes much sense."

fyi The same comparison was removed from the benchmarks game and replaced with tasks that were still toy but more than a dozen lines.

Maybe this can be used for further improvement of the binary-trees benchmark.

It uses a TypedArena allocator and shows it’s fourfold faster than the default allocator.

Probably not - the current solution is 2 times slower than the fastest solution, while the linked version is ~5x slower :wink:

Please don’t implement your own custom “arena” or “memory pool” or “free list” - they will not be accepted.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/binarytrees.html#binarytrees

If such a typed pool was part of the general registry we could use it. The fastest c++ versions use either a boost memory pool or some pool from apache. You won’t beat the @sdanisch version, but you’ll also have something that could be used for something other than filling memory with tree nodes and then deallocating all of them as efficiently as possible.