Benchmark game challenge and some optimization questions

First the good news, at the Debian benchmark game, clicking “summary stats”:

I see the top graph showing Julia is best for “Geometric Mean - Time plus Source Code size”. [Then further down a similar graph, at first confusing with Julia missing, but it seems to be a continuation, while with some overlap, of slower languages.]

I was thinking, with 1.8 out any day now, how about a benchmarking challenge, at least looking more at the outliers there, since, speed-wise only, Julia isn’t always fastest. Also a different challenge, who can make the smallest (or by “gz”) Julia programs. I like code-golfing, and if those programs also happen to be fast, or fastest then great!

I beat the fastest “pidigits” Julia program there [EDIT: I missed out there was a faster Julia program in the new “hand-written vector instructions | “unsafe” | naked ffi”-category below (only “naked ffi” applies for the Julia program), which is still slower than the PHP program not in that category] by 16.9% (for speed, or 18.9% if using -O0 which the Benchmark game disallows), by rather porting from the PHP program, which has a different algorithm, so I can’t take credit for it. That seems allowed, if not too different, still feels like cheating. We are beaten by 2x with that PHP program (and are even a bit slower than the Perl program which as also a bit smaller). The original fastest Julia program is beaten by 2.6x by the fastest programs (Rust and C++ are tied).

Maybe someone wants to look into porting the Perl program, see if faster or shorter possible with that algorithm.

Julia startup is 0.13 sec. so taking that into account, running from the REPL, I get up to 21.3% speedup (compared to original including startup overhead), though nowhere close to 2x of PHP…

julia> @time @suppress pidigits(10000)
  1.718259 seconds (1.18 M allocations: 9.165 GiB, 4.02% gc time)

Note, the C++ program uses (and the PHP code uses ob_implicit_flush(1); which I believe is the same):

std::ios_base::sync_with_stdio(false);

https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio

If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.

How would I replicate this (and is this only for mixing with C code)? It seems though not too important with the C program only slightly slower (by 0.02 sec. or 2.8%), and I only see it using printf. This might be important for other programs, and also (what I see from a different Julia program, the one mentioned below):

    # The expanded form of Printf.@printf macro takes a significant time to compile,
    # accounting for 10% of the total program runtime. Base.Ryu.writefixed(::Float64, ::Int)
    # should already be compiled into the default system image.

About: “box plot IQRs overlap”. The “lower fence”, is higher for Julia than for most languages (at least of those in the former “How many more times CPU seconds?”-graph), I’m guessing it means the extra startup (and/or compilation) overhead of Julia:
https://blog.bioturing.com/2018/05/22/how-to-compare-box-plots/

The fastest “n-body” C++ program is 40% slower than Julia (which is fastest, tied with “Classic Fortran” for speed while much smaller), and the fastest C program is 60% slower than Julia, so looking into the VRSQRTPS “legacy” assembly instruction, isn’t high priority (likely already used in the Julia program, enabled by @fastmath and much simpler code, that also happens to be portable unlike the C/C++ code). I had written stuff below to look into, that I found intriguing, thinking we needed to translate that function (Julia’s code uses Newton–Raphson, not Goldschmidt’s algorithm):

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gcc-9.html

#include <x86intrin.h>
[..]
// utilize vrsqrtps to compute an approximation of 1/sqrt(x) with float,
// cast back to double and refine using a variation of
// Goldschmidt’s algorithm to get < 1e-9 error
static inline __m256d _mm256_rsqrt_pd(__m256d s) {

https://reviews.llvm.org/D46498

The legacy VRCPPS/VRSQRTPS instructions aren’t available in 512-bit versions. The new increased precision versions are. So we can use those to implement v16f32 reciprocal estimates.

For KNL CPUs we can probably use VRCP28PS/VRSQRT28PS and avoid the NR step altogether, but I leave that for a future patch.

https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-etc-on-intel.html

I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. […] we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.

Reverse engineering AMD’s exact algorithm and reimplementing it with Intel’s instructions seemed like it would be a lot of work and tricky to reimplement correctly.

https://clang.llvm.org/doxygen/avxintrin_8h.html

#GC.enable(false)
function pidigits(N::Int)

#ob_implicit_flush(1);
#ob_start(NULL, 4096);

w = big"0"

k = 1
n1 = big"4"
n2 = big"3"
d = big"1"

i = 0
while true
   #digit
   u = n1 ÷ d;
   v = n2 ÷ d;
   if u == v
      print(u) # echo gmp_strval($u);
      i+=1
      if i % 10 == 0
         print("\t:", i, '\n') # echo "\t:" , $i , "\n";
#GC.gc()
      end
      if i == N
         break
      end
      #extract
      u = d * (-10 * u);
      n1 = n1 * 10;
      n1 = n1 + u;
      n2 = n2 * 10;
      n2 = n2 + u;
   else
      #produce
      k2 = k << 1;
      u = n1 * (k2 - 1);
      v = n2 + n2;
      w = n1 * (k - 1);
      n1 = u + v;
      u = n2 * (k + 2);
      n2 = w + u;
      d = d * (k2 + 1);
      k+=1;
   end
end
if i % 10 != 0
#TODO: finish this line, not needed for N = 10000 (nor even sure it's needed in the original PHP code):
#   println(' ' * (10 - N % 10), "\t:", N, "\n") # "echo str_repeat(' ', 10 - $N % 10), "\t:", $N, "\n";")
end
end

n = parse(Int,ARGS[1])
pidigits(n)
2 Likes

Programs that need help (besides 2.6x slower pidigits already mentioned):

7.7x slower, despite multi-threaded. Tree-code seems difficult for Julia, would it be allowed (by Debian Benchmark game, since non-idiomatic) and could rather using Libc.malloc and free help?

It seems very unfair Julia isn’t allowed to use external packages (which would also be precompiled helping with startup), while the fastest program, C++, uses non-std lib:

#include <boost/iterator/counting_iterator.hpp>

and:

using MemoryPool = std::pmr::monotonic_buffer_resource;

is 6x slower, since not multi-threaded, and the 4.2x slower Lisp SBCL #5 isn’t either, so Julia only 42% than that fastest single-threaded one (and in a sense only 50% slower than the fastest multi-thread one).

4.6x slower, because not multi-threaded, also beaten by PHP, and Lua 5.4.4 (the latter singel-threaded), not even then LuaJIT?

2.6 slower; despite its multi-threading

80% slower, multi-threading would likely help, is only 0.03 sec. slower than the only faster [C] single-threaded program.

is 60% slower (in part) since 3 of the 4 cores are less loaded than ideal. This may be an issue in Julia, and maybe already solved in Julia 1.8 or 1.9?

again 60% slower (while beating C, not C++ or Rust the fastest code).

50% slower, than fastest which also happens to not be multi-threaded (all between are), but that fastest C code has some voodoo (and next fastest [C] code is 10% slower, even though multi-threaded), which seems a bit shady, but could be ported:

 * This benchmark uses a lookup table for the symbolic codes from the
 * output space of the random number generator, so in the case of an
 * random number generator with a modest number of outputs (the modulo
 * IM in this case), the table even fits in CPU cache. [..]
 *  Adding random number sequence fast skipping would also allow a threading
 * speedup, though it my be limited by memory bandwidth.
1 Like

Oh @palli, what are you doing;)

Classical nerd sniping, if you are going to start a thread with ‘This piece of PHP is faster than Julia’ I’ll bite…

5 Likes

One thing I recently learned was using

mutable struct Node{T}
    data::T
    l::Node
    r::Node
    Node(i, l, r) = new(i, l, r)
    function Node(i)
        n = new()
        n.data = i
        n.l = n
        n.r = n
        n
    end
end

as a base tree structure. If l === r, we’re at a leaf, else we’re not. Saves us from dynamic dispatch, since we’re always checking a Node! Not sure if this can be used here though.

I was also playing with the thought of having an array backed tree, to keep all elements close together for a cache optimization. I haven’t tried implementing it yet though.


BenchmarksGame is often not that representative - I think in the past there were some quite weird things done in regards to what versions in which language were allowed, while others were not. There also was a period where PRs just weren’t merged at all? Not sure if that’s still the case.

3 Likes

So ‘binary-trees’ it is then for now? Anyone care to start a provocative new thread?

I have no intention of trying to improve BenchmarkGame in the foreseeable future - I was merely suggesting a trick for these trees in particular on how to avoid dispatch :slight_smile: If someone wants to implement it, they’re welcome to do so.

Admittedly, I also haven’t looked at the other benchmarks - I don’t find BenchmarkGame to be particularly representative of either performance (considering cache effects in context) nor well-maintainable code, as some are optimized to an almost ridiculous degree.

4 Likes

I was curious enough to have a look. I played around with this entry and see for

@time binary_trees_serial(stdout, 21)
@time binary_trees_serial(stdout, 21)

@time (GC.enable(false); binary_trees_serial(stdout, 21); GC.enable(true); GC.gc())
@time (GC.enable(false); binary_trees_serial(stdout, 21); GC.enable(true); GC.gc())

@time binary_trees_parallel(stdout, 21)
@time binary_trees_parallel(stdout, 21) 

@time (GC.enable(false); binary_trees_parallel(stdout, 21); GC.enable(true); GC.gc())
@time (GC.enable(false); binary_trees_parallel(stdout, 21); GC.enable(true); GC.gc())

@time binary_trees_parallel_opt(stdout, 21)
@time binary_trees_parallel_opt(stdout, 21) 

@time (GC.enable(false); binary_trees_parallel_opt(stdout, 21); GC.enable(true); GC.gc())
@time (GC.enable(false); binary_trees_parallel_opt(stdout, 21); GC.enable(true); GC.gc())

the following benchmark results

# serial version of the benchmark entry
 12.323571 seconds (305.54 M allocations: 9.108 GiB, 18.83% gc time, 0.39% compilation time)
 12.018895 seconds (305.49 M allocations: 9.104 GiB, 16.40% gc time)
# serial version of the benchmark entry, GC delayed to end of run
 11.442474 seconds (305.49 M allocations: 9.104 GiB, 11.72% gc time)
 11.242221 seconds (305.49 M allocations: 9.104 GiB, 10.70% gc time)
# benchmark entry
  7.832903 seconds (308.76 M allocations: 9.164 GiB, 49.68% gc time, 2.25% compilation time)
  7.553655 seconds (308.61 M allocations: 9.156 GiB, 49.83% gc time)
# benchmark entry, GC delayed to end of run
  4.925564 seconds (308.56 M allocations: 9.154 GiB, 26.80% gc time)
  5.600125 seconds (308.57 M allocations: 9.155 GiB, 22.83% gc time)
# small improvements
  7.489242 seconds (305.54 M allocations: 9.129 GiB, 58.96% gc time, 1.16% compilation time)
  7.337273 seconds (305.49 M allocations: 9.125 GiB, 56.08% gc time)
# small improvements, GC delayed to end of run
  4.890634 seconds (305.49 M allocations: 9.125 GiB, 28.99% gc time)
  4.442699 seconds (305.49 M allocations: 9.125 GiB, 27.33% gc time)

So small improvements seem possible (although eliminating allocations seems to be explicitly prohibited by the benchmark rules) and this benchmark checks Julia’s runtime allocation mechanisms and garbage collector AFAIU.

Yes, I recall some such great result. But with yourself using Libc.malloc and free (and while not idiomatic for Julia code, whose to say disallowed since included with Julia), you would never run out of memory, have such a bug. I guess the rule about disabling GC, is because if you do and get away with it in the benchmark, with the same code but different defaults you might not, but that wouldn’t happen with the alternative, even if you wouldn’t disable GC. I’m not sure if the GC would slow down still, could be tested. I would be interested in the fastest version of this, without bugs, even if disallowed by the rules.

The rules are however unfair allowing C++ to use (which IS in std, while maybe not very idiomatic?):

using MemoryPool = std::pmr::monotonic_buffer_resource;

and Julia not to use Blobs.jl which seems to be for such. Yes, that’s an external library, but so is Boost that C++ uses.

[EDIT: The program I had in this post was wrong, and the speedup I had went away when corrected, keeping here otherwise relevant info.]

I see also in this fastest [Rust] program, has something related (and I don’t yet do anything similar in Julia), to make multi-threading faster, or just for it to work at all?: mandelbrot Rust #7 program (Benchmarks Game)

    let stdout_unlocked = std::io::stdout();
    // Main thread only can print to stdout
    let mut stdout = stdout_unlocked.lock();

[.. then the only line using above, is the last line in main:
    let _ = stdout.write_all(&rows);

I did find in printf() and stdio in the Julia runtime · The Julia Language

In special cases, like signal handlers, where the full libuv infrastructure is too heavy, jl_safe_printf() can be used to write(2) directly to STDERR_FILENO:
[…]
Use of ios.c in these modules is mostly self-contained and separated from the libuv I/O system. However, there is one place where femtolisp calls through to jl_printf() with a legacy ios_t stream.
[…]
This allows uv_stream_t pointers to point to ios_t streams.

Sometimes speeding up is simple, [EDIT: the invalid program was:] 0m0,882s / 0m1,568s = 43.8% faster, if it holds up on the testing machine, Julia will be 0.80 sec. or 12% faster than the fastest [Rust] program (which is 0.91 sec.).

I had hopes the know bug was trivial with a large prefix of the file corect:

$ cmp mandel_new.out mandel_org.out
mandel_new.out mandel_org.out differ: byte 1516, line 3

[the fix should be easy, <s>at least keeping the speed-advantage</s> (mainly the division, those are slow..., is moved out of the (inner-most) loop):]

[deleted code]
1 Like

So you think something like my current best from here

function binary_trees_parallel_opt_gc(io, n::Int)
    write(io, "stretch tree of depth $(n+1)\t check: $(check(make(n+1)))\n")

    long_tree::Node = make(21)

    minDepth::Int = 4
    resultSize::Int = trunc(Int,(n - minDepth) / 2) + 1
    results = Vector{String}(undef,resultSize)
    for depth::Int = minDepth:2:n
        niter::Int = 1 << (n - depth + minDepth)
        cs = Vector{Int}(undef, niter)
        GC.enable(false)
        Threads.@threads for i = 1:niter
            cs[i] = check(make(depth))
        end#for
        GC.enable(true)
        GC.gc(false)
        c::Int = sum(cs)
        index::Int = trunc(Int,(depth - minDepth)/2) + 1
        results[index] = "$niter\t trees of depth $depth\t check: $c\n"
    end
    for i in results
        write(io,i)
    end

    write(io, "long lived tree of depth $n\t check: $(check(long_tree))\n")
end#function

is non conformant? Note: I disable the GC only temporarily and start an incremental collection afterwards.

Without doing anything special (just suppressing some useless instructions while not changing the algorithm) I got a faster (if my computer is similar to yours) pidigits:

# pidigits benchmark
using Printf
function pidigits(N::Int, printout::Bool=true)
  i=k=ns=0
  k1=1
  a=BigInt(0)
  n=d=BigInt(1)
  while i<N
    a+=2n
    a*=k1+=2
    d*=k1
    n*=k+=1
    if a<n continue end
    t,u=divrem(a+3n,d)
    if d<=u+n continue end
    a=10(a-d*t)
    n*=10
    ns=10ns+Int(t)
    i+=1
    if mod(i,10)==0
      if printout
        @printf("%10.10d\t:%d\n",ns,i)
      end
      ns=0
    end 
  end
  if printout && mod(N,10)!=0
    @printf("%-10d\t:%d\n",ns,N)
  end
end

which gives

@time pidigits(10000)
...
  0.853739 seconds (859.27 k allocations: 8.306 GiB, 6.57% gc time)

suppressing output does not make it appreciably faster.

3 Likes

In your code the GC is disabled for a while, bounded by niter. It seems it could be arbitrarily large, so it seems shady, since you call make that often, and too often would run out of memory. How much faster is this? And more importantly, why? Since I thought allocations should be free until you run out of memory triggering the GC… If you could disable the GC for a bounded fixed number of allocations at a time, and it helps…, it seems defensible. Since there are no deallocations/pruning of the tree (am I wrong?), you could say the program without GC enabled would also fill memory, so defensible to disable the GC indefinitely, but I think the point there is to check how fast allocations are and potential overhead of the GC (in real-world code you would also deallocate, make garbage, and relying on it not happening is maybe not in the spirit of the benchmark).

FYI: What looks to be a comment in OCaml code, and is, seems to actually be code, using a language extension (so just a comment on older/other versions without extension), to control the GC:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-ocaml-2.html

let () =
  (* Gc.set { (Gc.get()) with Gc.minor_heap_size = 1024 * 1024; max_overhead = -1; }; *)
  let c = check (make stretch_depth) in
  Printf.printf "stretch tree of depth %i\t check: %i\n" stretch_depth c

https://ocaml.org/manual/doccomments.html

Not only do all the fast programs use memory pools, including Fortran, but I’m not sure, but I think this Fortran code may actually be calling a C library for that(?):

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-ifc-2.html

module apr
  use iso_c_binding
  implicit none

  interface

    integer(c_int) function apr_initialize() bind(C)
      import
    end function

    type(c_ptr) function apr_palloc(p,size) bind(C)

I would argue we were also allowed calling C, since we have the ccall keyword, and it’s very idemotic to use it. Is use iso_c_binding common in Fortran (I know it’s now standardized, before a GNU extension), and maybe only way to to such pointer/tree code?

Measured with @btime, without IO

  7.476 s (304360009 allocations: 9.03 GiB) # original
  4.176 s (301292435 allocations: 9.00 GiB) # current best

and with IO

[...]
  8.563 s (308611548 allocations: 9.16 GiB)
[...]
  4.382 s (305485960 allocations: 9.13 GiB)

I can only guess: no interaction of (stop the world?) GC with the hot loop.

See also Julia’s own: Julia Micro-Benchmarks

It really needs be be rerun and updated (not done since 1.0), e.g. before 1.8 released. And if anything in Julia can be fixed for those or Debian’s benchmarks to make batter that would be great! Note, both have a pi benchmark, while very different.

Note, also this discussion (and adding memory pools as a package, I’m not sure Debian will allow that, maybe if ithe the standard library?):

I want renewed interest in the Benchmark game (and please post progress here, also showing 1.9-DEV compared to 1.8.2).

An additional bonus to starting twice as fast, and making Julia fastest on some benchmarks:

$ /usr/bin/time --format "%M kB" julia -J BenchmarkingSysimage.so --startup-file=no --history-file=no -e0
95804 kB

compared to:

that’s not enough for all benchmarks:

fannkuch-redux - Which programs are fastest? (Benchmarks Game) has single-threaded Lisp 44% faster than Julia or 13 sec. ahead, and we’re 5.8 times slower than fastest (multi-threaded).

Here single-threaded Julia could beat multi-threaded C++:

1 Like

Not to denegrate the benchmark game etc but those excited about benchmarking should take a look at my post RE the Pure Julia HINT benchmark

HINT was designed as a different kind of benchmark, first off it measures convergence rate of a numerical calculation, and second off it does so over a range of different memory usages, so it measures memory bandwidth issues and shows a certain 1D “structure” of a computers performance.

Hoping to get some interest in reviving this great benchmark from benchmark fans. I would be happy to contribute after recovering from COVID… :pensive::face_with_thermometer::sneezing_face:

Some benchmarks at the Debian benchmark game are still really bad for Julia. At least 1 or 2.

E.g. for binary trees. So I looked into what others are doing or not doing.

Rust uses:

Maybe Julia should use Bumper.jl, but I thought packages forbidden… is that an stdlib for Rust? Or Libc.malloc/free?

C++ uses:
https://en.cppreference.com/w/cpp/memory/monotonic_buffer_resource

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-java-7.html

I think Java doesn’t use any such tricks, so why is it beating Julia? Chapel is actually slower than Julia and it doesn’t have GC.

Maybe simply running in 1.10 is better to close the gap with at least Java.

GC languages have a disadvantage there, or even just the heap malloc/free.

1 Like

It’s more of a trolling game than a benchmarks game. But sadly many people think it’s relevant (for something), even though both the concept and the implementation were bunk last I checked.

1 Like

Can you explain why? In my opinion, crowdsourcing a benchmark is the only way to get something decent in every language, precisely by triggering aficionados like us

4 Likes

A few examples:

  1. the rules of the benchmark were unfair/simplistic last time I checked, for example with how the compilation time was counted together with the run time for Julia

  2. it’s not clear what the benchmark is even supposed to be representative of, seeing as some solutions are optimized to an absurd degree, while others are just clean code, some use standard libraries while others use user packages, some are single-threaded while others are multi-threaded, some use FFI while others are pure, …

  3. To reduct the above to the absurd, any language in the benchmark could/would ideally be “represented” in the benchmark by an implementation that just calls a hand-written assembly implementation. Since the hand-written assembly implementation might as well be the same for all languages, the benchmarks doesn’t compare anything, in the ideal.

  4. the benchmarks are purposefully ran on outdated hardware

2 Likes