Benchmark game challenge and some optimization questions

First the good news, at the Debian benchmark game, clicking “summary stats”:

I see the top graph showing Julia is best for “Geometric Mean - Time plus Source Code size”. [Then further down a similar graph, at first confusing with Julia missing, but it seems to be a continuation, while with some overlap, of slower languages.]

I was thinking, with 1.8 out any day now, how about a benchmarking challenge, at least looking more at the outliers there, since, speed-wise only, Julia isn’t always fastest. Also a different challenge, who can make the smallest (or by “gz”) Julia programs. I like code-golfing, and if those programs also happen to be fast, or fastest then great!

I beat the fastest “pidigits” Julia program there [EDIT: I missed out there was a faster Julia program in the new “hand-written vector instructions | “unsafe” | naked ffi”-category below (only “naked ffi” applies for the Julia program), which is still slower than the PHP program not in that category] by 16.9% (for speed, or 18.9% if using -O0 which the Benchmark game disallows), by rather porting from the PHP program, which has a different algorithm, so I can’t take credit for it. That seems allowed, if not too different, still feels like cheating. We are beaten by 2x with that PHP program (and are even a bit slower than the Perl program which as also a bit smaller). The original fastest Julia program is beaten by 2.6x by the fastest programs (Rust and C++ are tied).

Maybe someone wants to look into porting the Perl program, see if faster or shorter possible with that algorithm.

Julia startup is 0.13 sec. so taking that into account, running from the REPL, I get up to 21.3% speedup (compared to original including startup overhead), though nowhere close to 2x of PHP…

julia> @time @suppress pidigits(10000)
  1.718259 seconds (1.18 M allocations: 9.165 GiB, 4.02% gc time)

Note, the C++ program uses (and the PHP code uses ob_implicit_flush(1); which I believe is the same):

std::ios_base::sync_with_stdio(false);

https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio

If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.

How would I replicate this (and is this only for mixing with C code)? It seems though not too important with the C program only slightly slower (by 0.02 sec. or 2.8%), and I only see it using printf. This might be important for other programs, and also (what I see from a different Julia program, the one mentioned below):

    # The expanded form of Printf.@printf macro takes a significant time to compile,
    # accounting for 10% of the total program runtime. Base.Ryu.writefixed(::Float64, ::Int)
    # should already be compiled into the default system image.

About: “box plot IQRs overlap”. The “lower fence”, is higher for Julia than for most languages (at least of those in the former “How many more times CPU seconds?”-graph), I’m guessing it means the extra startup (and/or compilation) overhead of Julia:

The fastest “n-body” C++ program is 40% slower than Julia (which is fastest, tied with “Classic Fortran” for speed while much smaller), and the fastest C program is 60% slower than Julia, so looking into the VRSQRTPS “legacy” assembly instruction, isn’t high priority (likely already used in the Julia program, enabled by @fastmath and much simpler code, that also happens to be portable unlike the C/C++ code). I had written stuff below to look into, that I found intriguing, thinking we needed to translate that function (Julia’s code uses Newton–Raphson, not Goldschmidt’s algorithm):

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gcc-9.html

#include <x86intrin.h>
[..]
// utilize vrsqrtps to compute an approximation of 1/sqrt(x) with float,
// cast back to double and refine using a variation of
// Goldschmidt’s algorithm to get < 1e-9 error
static inline __m256d _mm256_rsqrt_pd(__m256d s) {

https://reviews.llvm.org/D46498

The legacy VRCPPS/VRSQRTPS instructions aren’t available in 512-bit versions. The new increased precision versions are. So we can use those to implement v16f32 reciprocal estimates.

For KNL CPUs we can probably use VRCP28PS/VRSQRT28PS and avoid the NR step altogether, but I leave that for a future patch.

I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. […] we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.

Reverse engineering AMD’s exact algorithm and reimplementing it with Intel’s instructions seemed like it would be a lot of work and tricky to reimplement correctly.

https://clang.llvm.org/doxygen/avxintrin_8h.html

#GC.enable(false)
function pidigits(N::Int)

#ob_implicit_flush(1);
#ob_start(NULL, 4096);

w = big"0"

k = 1
n1 = big"4"
n2 = big"3"
d = big"1"

i = 0
while true
   #digit
   u = n1 ÷ d;
   v = n2 ÷ d;
   if u == v
      print(u) # echo gmp_strval($u);
      i+=1
      if i % 10 == 0
         print("\t:", i, '\n') # echo "\t:" , $i , "\n";
#GC.gc()
      end
      if i == N
         break
      end
      #extract
      u = d * (-10 * u);
      n1 = n1 * 10;
      n1 = n1 + u;
      n2 = n2 * 10;
      n2 = n2 + u;
   else
      #produce
      k2 = k << 1;
      u = n1 * (k2 - 1);
      v = n2 + n2;
      w = n1 * (k - 1);
      n1 = u + v;
      u = n2 * (k + 2);
      n2 = w + u;
      d = d * (k2 + 1);
      k+=1;
   end
end
if i % 10 != 0
#TODO: finish this line, not needed for N = 10000 (nor even sure it's needed in the original PHP code):
#   println(' ' * (10 - N % 10), "\t:", N, "\n") # "echo str_repeat(' ', 10 - $N % 10), "\t:", $N, "\n";")
end
end

n = parse(Int,ARGS[1])
pidigits(n)
1 Like

Programs that need help (besides 2.6x slower pidigits already mentioned):

7.7x slower, despite multi-threaded. Tree-code seems difficult for Julia, would it be allowed (by Debian Benchmark game, since non-idiomatic) and could rather using Libc.malloc and free help?

It seems very unfair Julia isn’t allowed to use external packages (which would also be precompiled helping with startup), while the fastest program, C++, uses non-std lib:

#include <boost/iterator/counting_iterator.hpp>

and:

using MemoryPool = std::pmr::monotonic_buffer_resource;

is 6x slower, since not multi-threaded, and the 4.2x slower Lisp SBCL #5 isn’t either, so Julia only 42% than that fastest single-threaded one (and in a sense only 50% slower than the fastest multi-thread one).

4.6x slower, because not multi-threaded, also beaten by PHP, and Lua 5.4.4 (the latter singel-threaded), not even then LuaJIT?

2.6 slower; despite its multi-threading

80% slower, multi-threading would likely help, is only 0.03 sec. slower than the only faster [C] single-threaded program.

is 60% slower (in part) since 3 of the 4 cores are less loaded than ideal. This may be an issue in Julia, and maybe already solved in Julia 1.8 or 1.9?

again 60% slower (while beating C, not C++ or Rust the fastest code).

50% slower, than fastest which also happens to not be multi-threaded (all between are), but that fastest C code has some voodoo (and next fastest [C] code is 10% slower, even though multi-threaded), which seems a bit shady, but could be ported:

 * This benchmark uses a lookup table for the symbolic codes from the
 * output space of the random number generator, so in the case of an
 * random number generator with a modest number of outputs (the modulo
 * IM in this case), the table even fits in CPU cache. [..]
 *  Adding random number sequence fast skipping would also allow a threading
 * speedup, though it my be limited by memory bandwidth.

Oh @palli, what are you doing;)

Classical nerd sniping, if you are going to start a thread with ‘This piece of PHP is faster than Julia’ I’ll bite…

3 Likes

One thing I recently learned was using

mutable struct Node{T}
    data::T
    l::Node
    r::Node
    Node(i, l, r) = new(i, l, r)
    function Node(i)
        n = new()
        n.data = i
        n.l = n
        n.r = n
        n
    end
end

as a base tree structure. If l === r, we’re at a leaf, else we’re not. Saves us from dynamic dispatch, since we’re always checking a Node! Not sure if this can be used here though.

I was also playing with the thought of having an array backed tree, to keep all elements close together for a cache optimization. I haven’t tried implementing it yet though.


BenchmarksGame is often not that representative - I think in the past there were some quite weird things done in regards to what versions in which language were allowed, while others were not. There also was a period where PRs just weren’t merged at all? Not sure if that’s still the case.

1 Like

So ‘binary-trees’ it is then for now? Anyone care to start a provocative new thread?

I have no intention of trying to improve BenchmarkGame in the foreseeable future - I was merely suggesting a trick for these trees in particular on how to avoid dispatch :slight_smile: If someone wants to implement it, they’re welcome to do so.

Admittedly, I also haven’t looked at the other benchmarks - I don’t find BenchmarkGame to be particularly representative of either performance (considering cache effects in context) nor well-maintainable code, as some are optimized to an almost ridiculous degree.

1 Like

I was curious enough to have a look. I played around with this entry and see for

@time binary_trees_serial(stdout, 21)
@time binary_trees_serial(stdout, 21)

@time (GC.enable(false); binary_trees_serial(stdout, 21); GC.enable(true); GC.gc())
@time (GC.enable(false); binary_trees_serial(stdout, 21); GC.enable(true); GC.gc())

@time binary_trees_parallel(stdout, 21)
@time binary_trees_parallel(stdout, 21) 

@time (GC.enable(false); binary_trees_parallel(stdout, 21); GC.enable(true); GC.gc())
@time (GC.enable(false); binary_trees_parallel(stdout, 21); GC.enable(true); GC.gc())

@time binary_trees_parallel_opt(stdout, 21)
@time binary_trees_parallel_opt(stdout, 21) 

@time (GC.enable(false); binary_trees_parallel_opt(stdout, 21); GC.enable(true); GC.gc())
@time (GC.enable(false); binary_trees_parallel_opt(stdout, 21); GC.enable(true); GC.gc())

the following benchmark results

# serial version of the benchmark entry
 12.323571 seconds (305.54 M allocations: 9.108 GiB, 18.83% gc time, 0.39% compilation time)
 12.018895 seconds (305.49 M allocations: 9.104 GiB, 16.40% gc time)
# serial version of the benchmark entry, GC delayed to end of run
 11.442474 seconds (305.49 M allocations: 9.104 GiB, 11.72% gc time)
 11.242221 seconds (305.49 M allocations: 9.104 GiB, 10.70% gc time)
# benchmark entry
  7.832903 seconds (308.76 M allocations: 9.164 GiB, 49.68% gc time, 2.25% compilation time)
  7.553655 seconds (308.61 M allocations: 9.156 GiB, 49.83% gc time)
# benchmark entry, GC delayed to end of run
  4.925564 seconds (308.56 M allocations: 9.154 GiB, 26.80% gc time)
  5.600125 seconds (308.57 M allocations: 9.155 GiB, 22.83% gc time)
# small improvements
  7.489242 seconds (305.54 M allocations: 9.129 GiB, 58.96% gc time, 1.16% compilation time)
  7.337273 seconds (305.49 M allocations: 9.125 GiB, 56.08% gc time)
# small improvements, GC delayed to end of run
  4.890634 seconds (305.49 M allocations: 9.125 GiB, 28.99% gc time)
  4.442699 seconds (305.49 M allocations: 9.125 GiB, 27.33% gc time)

So small improvements seem possible (although eliminating allocations seems to be explicitly prohibited by the benchmark rules) and this benchmark checks Julia’s runtime allocation mechanisms and garbage collector AFAIU.

Yes, I recall some such great result. But with yourself using Libc.malloc and free (and while not idiomatic for Julia code, whose to say disallowed since included with Julia), you would never run out of memory, have such a bug. I guess the rule about disabling GC, is because if you do and get away with it in the benchmark, with the same code but different defaults you might not, but that wouldn’t happen with the alternative, even if you wouldn’t disable GC. I’m not sure if the GC would slow down still, could be tested. I would be interested in the fastest version of this, without bugs, even if disallowed by the rules.

The rules are however unfair allowing C++ to use (which IS in std, while maybe not very idiomatic?):

using MemoryPool = std::pmr::monotonic_buffer_resource;

and Julia not to use Blobs.jl which seems to be for such. Yes, that’s an external library, but so is Boost that C++ uses.

[EDIT: The program I had in this post was wrong, and the speedup I had went away when corrected, keeping here otherwise relevant info.]

I see also in this fastest [Rust] program, has something related (and I don’t yet do anything similar in Julia), to make multi-threading faster, or just for it to work at all?: mandelbrot Rust #7 program (Benchmarks Game)

    let stdout_unlocked = std::io::stdout();
    // Main thread only can print to stdout
    let mut stdout = stdout_unlocked.lock();

[.. then the only line using above, is the last line in main:
    let _ = stdout.write_all(&rows);

I did find in printf() and stdio in the Julia runtime · The Julia Language

In special cases, like signal handlers, where the full libuv infrastructure is too heavy, jl_safe_printf() can be used to write(2) directly to STDERR_FILENO:
[…]
Use of ios.c in these modules is mostly self-contained and separated from the libuv I/O system. However, there is one place where femtolisp calls through to jl_printf() with a legacy ios_t stream.
[…]
This allows uv_stream_t pointers to point to ios_t streams.

Sometimes speeding up is simple, [EDIT: the invalid program was:] 0m0,882s / 0m1,568s = 43.8% faster, if it holds up on the testing machine, Julia will be 0.80 sec. or 12% faster than the fastest [Rust] program (which is 0.91 sec.).

I had hopes the know bug was trivial with a large prefix of the file corect:

$ cmp mandel_new.out mandel_org.out
mandel_new.out mandel_org.out differ: byte 1516, line 3

[the fix should be easy, <s>at least keeping the speed-advantage</s> (mainly the division, those are slow..., is moved out of the (inner-most) loop):]

[deleted code]

So you think something like my current best from here

function binary_trees_parallel_opt_gc(io, n::Int)
    write(io, "stretch tree of depth $(n+1)\t check: $(check(make(n+1)))\n")

    long_tree::Node = make(21)

    minDepth::Int = 4
    resultSize::Int = trunc(Int,(n - minDepth) / 2) + 1
    results = Vector{String}(undef,resultSize)
    for depth::Int = minDepth:2:n
        niter::Int = 1 << (n - depth + minDepth)
        cs = Vector{Int}(undef, niter)
        GC.enable(false)
        Threads.@threads for i = 1:niter
            cs[i] = check(make(depth))
        end#for
        GC.enable(true)
        GC.gc(false)
        c::Int = sum(cs)
        index::Int = trunc(Int,(depth - minDepth)/2) + 1
        results[index] = "$niter\t trees of depth $depth\t check: $c\n"
    end
    for i in results
        write(io,i)
    end

    write(io, "long lived tree of depth $n\t check: $(check(long_tree))\n")
end#function

is non conformant? Note: I disable the GC only temporarily and start an incremental collection afterwards.

Without doing anything special (just suppressing some useless instructions while not changing the algorithm) I got a faster (if my computer is similar to yours) pidigits:

# pidigits benchmark
using Printf
function pidigits(N::Int, printout::Bool=true)
  i=k=ns=0
  k1=1
  a=BigInt(0)
  n=d=BigInt(1)
  while i<N
    a+=2n
    a*=k1+=2
    d*=k1
    n*=k+=1
    if a<n continue end
    t,u=divrem(a+3n,d)
    if d<=u+n continue end
    a=10(a-d*t)
    n*=10
    ns=10ns+Int(t)
    i+=1
    if mod(i,10)==0
      if printout
        @printf("%10.10d\t:%d\n",ns,i)
      end
      ns=0
    end 
  end
  if printout && mod(N,10)!=0
    @printf("%-10d\t:%d\n",ns,N)
  end
end

which gives

@time pidigits(10000)
...
  0.853739 seconds (859.27 k allocations: 8.306 GiB, 6.57% gc time)

suppressing output does not make it appreciably faster.

2 Likes

In your code the GC is disabled for a while, bounded by niter. It seems it could be arbitrarily large, so it seems shady, since you call make that often, and too often would run out of memory. How much faster is this? And more importantly, why? Since I thought allocations should be free until you run out of memory triggering the GC… If you could disable the GC for a bounded fixed number of allocations at a time, and it helps…, it seems defensible. Since there are no deallocations/pruning of the tree (am I wrong?), you could say the program without GC enabled would also fill memory, so defensible to disable the GC indefinitely, but I think the point there is to check how fast allocations are and potential overhead of the GC (in real-world code you would also deallocate, make garbage, and relying on it not happening is maybe not in the spirit of the benchmark).

FYI: What looks to be a comment in OCaml code, and is, seems to actually be code, using a language extension (so just a comment on older/other versions without extension), to control the GC:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-ocaml-2.html

let () =
  (* Gc.set { (Gc.get()) with Gc.minor_heap_size = 1024 * 1024; max_overhead = -1; }; *)
  let c = check (make stretch_depth) in
  Printf.printf "stretch tree of depth %i\t check: %i\n" stretch_depth c

https://ocaml.org/manual/doccomments.html

Not only do all the fast programs use memory pools, including Fortran, but I’m not sure, but I think this Fortran code may actually be calling a C library for that(?):

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-ifc-2.html

module apr
  use iso_c_binding
  implicit none

  interface

    integer(c_int) function apr_initialize() bind(C)
      import
    end function

    type(c_ptr) function apr_palloc(p,size) bind(C)

I would argue we were also allowed calling C, since we have the ccall keyword, and it’s very idemotic to use it. Is use iso_c_binding common in Fortran (I know it’s now standardized, before a GNU extension), and maybe only way to to such pointer/tree code?

Measured with @btime, without IO

  7.476 s (304360009 allocations: 9.03 GiB) # original
  4.176 s (301292435 allocations: 9.00 GiB) # current best

and with IO

[...]
  8.563 s (308611548 allocations: 9.16 GiB)
[...]
  4.382 s (305485960 allocations: 9.13 GiB)

I can only guess: no interaction of (stop the world?) GC with the hot loop.

See also Julia’s own: Julia Micro-Benchmarks

It really needs be be rerun and updated (not done since 1.0), e.g. before 1.8 released. And if anything in Julia can be fixed for those or Debian’s benchmarks to make batter that would be great! Note, both have a pi benchmark, while very different.

Note, also this discussion (and adding memory pools as a package, I’m not sure Debian will allow that, maybe if ithe the standard library?):