First the good news, at the Debian benchmark game, clicking “summary stats”:
I see the top graph showing Julia is best for “Geometric Mean - Time plus Source Code size”. [Then further down a similar graph, at first confusing with Julia missing, but it seems to be a continuation, while with some overlap, of slower languages.]
I was thinking, with 1.8 out any day now, how about a benchmarking challenge, at least looking more at the outliers there, since, speed-wise only, Julia isn’t always fastest. Also a different challenge, who can make the smallest (or by “gz”) Julia programs. I like code-golfing, and if those programs also happen to be fast, or fastest then great!
I beat the fastest “pidigits” Julia program there [EDIT: I missed out there was a faster Julia program in the new “hand-written vector instructions | “unsafe” | naked ffi”-category below (only “naked ffi” applies for the Julia program), which is still slower than the PHP program not in that category] by 16.9% (for speed, or 18.9% if using -O0
which the Benchmark game disallows), by rather porting from the PHP program, which has a different algorithm, so I can’t take credit for it. That seems allowed, if not too different, still feels like cheating. We are beaten by 2x with that PHP program (and are even a bit slower than the Perl program which as also a bit smaller). The original fastest Julia program is beaten by 2.6x by the fastest programs (Rust and C++ are tied).
Maybe someone wants to look into porting the Perl program, see if faster or shorter possible with that algorithm.
Julia startup is 0.13 sec. so taking that into account, running from the REPL, I get up to 21.3% speedup (compared to original including startup overhead), though nowhere close to 2x of PHP…
julia> @time @suppress pidigits(10000)
1.718259 seconds (1.18 M allocations: 9.165 GiB, 4.02% gc time)
Note, the C++ program uses (and the PHP code uses ob_implicit_flush(1);
which I believe is the same):
std::ios_base::sync_with_stdio(false);
https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio
If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.
How would I replicate this (and is this only for mixing with C code)? It seems though not too important with the C program only slightly slower (by 0.02 sec. or 2.8%), and I only see it using printf
. This might be important for other programs, and also (what I see from a different Julia program, the one mentioned below):
# The expanded form of Printf.@printf macro takes a significant time to compile,
# accounting for 10% of the total program runtime. Base.Ryu.writefixed(::Float64, ::Int)
# should already be compiled into the default system image.
About: “box plot IQRs overlap”. The “lower fence”, is higher for Julia than for most languages (at least of those in the former “How many more times CPU seconds?”-graph), I’m guessing it means the extra startup (and/or compilation) overhead of Julia:
https://blog.bioturing.com/2018/05/22/how-to-compare-box-plots/
The fastest “n-body” C++ program is 40% slower than Julia (which is fastest, tied with “Classic Fortran” for speed while much smaller), and the fastest C program is 60% slower than Julia, so looking into the VRSQRTPS “legacy” assembly instruction, isn’t high priority (likely already used in the Julia program, enabled by @fastmath
and much simpler code, that also happens to be portable unlike the C/C++ code). I had written stuff below to look into, that I found intriguing, thinking we needed to translate that function (Julia’s code uses Newton–Raphson, not Goldschmidt’s algorithm):
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gcc-9.html
#include <x86intrin.h>
[..]
// utilize vrsqrtps to compute an approximation of 1/sqrt(x) with float,
// cast back to double and refine using a variation of
// Goldschmidt’s algorithm to get < 1e-9 error
static inline __m256d _mm256_rsqrt_pd(__m256d s) {
https://reviews.llvm.org/D46498
The legacy VRCPPS/VRSQRTPS instructions aren’t available in 512-bit versions. The new increased precision versions are. So we can use those to implement v16f32 reciprocal estimates.
For KNL CPUs we can probably use VRCP28PS/VRSQRT28PS and avoid the NR step altogether, but I leave that for a future patch.
https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-etc-on-intel.html
I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. […] we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.
Reverse engineering AMD’s exact algorithm and reimplementing it with Intel’s instructions seemed like it would be a lot of work and tricky to reimplement correctly.
https://clang.llvm.org/doxygen/avxintrin_8h.html
#GC.enable(false)
function pidigits(N::Int)
#ob_implicit_flush(1);
#ob_start(NULL, 4096);
w = big"0"
k = 1
n1 = big"4"
n2 = big"3"
d = big"1"
i = 0
while true
#digit
u = n1 ÷ d;
v = n2 ÷ d;
if u == v
print(u) # echo gmp_strval($u);
i+=1
if i % 10 == 0
print("\t:", i, '\n') # echo "\t:" , $i , "\n";
#GC.gc()
end
if i == N
break
end
#extract
u = d * (-10 * u);
n1 = n1 * 10;
n1 = n1 + u;
n2 = n2 * 10;
n2 = n2 + u;
else
#produce
k2 = k << 1;
u = n1 * (k2 - 1);
v = n2 + n2;
w = n1 * (k - 1);
n1 = u + v;
u = n2 * (k + 2);
n2 = w + u;
d = d * (k2 + 1);
k+=1;
end
end
if i % 10 != 0
#TODO: finish this line, not needed for N = 10000 (nor even sure it's needed in the original PHP code):
# println(' ' * (10 - N % 10), "\t:", N, "\n") # "echo str_repeat(' ', 10 - $N % 10), "\t:", $N, "\n";")
end
end
n = parse(Int,ARGS[1])
pidigits(n)