Help with binary trees benchmark games example

Hey guys, I happened to look at the Julia benchmarks on the Benchmarks Game website lately. Overall, the Julia results are extremely intimidating. I dare say, a lot of the Julia examples, aside from being blazing fast, are written in comparatively few lines of code. If this isn’t a good advertisement for Julia, I don’t know what is.

Anyway, I noticed that the results for the binary tree game lagged very far behind the others. I figure that it’s probably possible to resolve this. For one, this uses Distributed and I was wondering if using Base.Threads would be faster.

First of all, for me the existing code with Distributed runs way faster than the 20 s reported on the site

julia> @btime binary_trees(io, 21);
  3.977 s (12586662 allocations: 384.16 MiB)

I’m not sure what the discrepancy is (I’m using 4 processes just like on the site). By the way, 3.97 s is only slightly slower than the C implementation.

I started messing around with doing a single-process threaded version of this and quickly realized I have no clue what I’m doing. The biggest problem I had translating this over to threads is that there’s no obvious threaded alternative to the @distributed map-reduce. The parallel map-reduce from Transducers is incredibly fast on smaller examples, but it wouldn’t be allowed in the benchmarks game, so I’d have to reverse engineer what the map reduce is doing in those cases. In my testing it was however a lot slower than using @distributed

julia> @btime binary_trees($io, $21);
  10.184 s (613767861 allocations: 18.29 GiB)

but again, I’m already seeing the original example be way faster than the 20 s reported on the site.

So, I have the following questions, if anyone can help:

  • Why am I seeing such a big discrepancy with the results on the site?
  • Is it likely to be possible to make the program any faster using Threads or is this a misconception? My thinking is that any tiny amount of IPC is an unnecessary overhead.
  • Can anyone suggest a simple, succinct threaded map-reduce that gets good results?
5 Likes

Biggest problem is this particular benchmark game is rigged against GC languages. It’s supposed to test GC performance, but language like C++ don’t have a GC, so they can use a memory pool.

You aren’t allowed to write a custom pool for this benchmark either.

I reckon if there was a fast memory pool in General you could use that.

3 Likes

Actually, looking at the Lisp implementation, they use a concurrent queue and it’s faster than Julia (8 vs 21). Could be using channels could get Julia closer without “cheating”.

2 Likes

This is interesting. I tried to translate the code to use Threads but I’m seeing the same thing: Distributed-based version is much faster (~5 sec) than Threads-based version (~10 sec), if I measure the time with @time binary_trees(devnull, 21).

Code
struct Node
    l::Union{Node,Nothing}
    r::Union{Node,Nothing}
end

make(n) =
    n === 0 ? Node(nothing, nothing) : Node(make(n-1), make(n-1))

check(node) =
    node.l === nothing ? 1 : 1 + check(node.l) + check(node.r)

function tmapreduce(f, op, xs)
    tasks = [
        Threads.@spawn(mapreduce(f, op, ys)) for
        ys in Iterators.partition(xs, length(xs) ÷ Threads.nthreads())
    ]
    return mapfoldl(fetch, op, tasks)
end

function binary_trees(io, n)
    write(io, "stretch tree of depth $(n+1)\t check: $(check(make(n+1)))\n")

    long_tree = make(n)

    d = 4
    while d <= n
        niter = 1 << (n - d + 4)
        c = tmapreduce(+, 1:niter) do _
            check(make(d))
        end
        write(io, "$niter\t trees of depth $d\t check: $c\n")
        d += 2
    end

    write(io, "long lived tree of depth $n\t check: $(check(long_tree))\n")
end#function

isinteractive() || binary_trees(stdout, parse(Int, ARGS[1]))

I first thought it may be because (IIUC) they are measuring the whole program runtime. But then the execution of the whole script still only takes ~8 sec in my machine. Maybe they have using DifferentialEquations in ~/.julia/config/startup.jl? :slight_smile:

The function is sending and receiving 9 * nworkers() = 36 ints. Since the whole execution takes a few seconds, I think the IPC overhead is negligible. I think this benchmark is very Distributed friendly. I guess it’s especially nice since the GC of each worker is completely independent. Though it’s still true that the initial startup cost is still hurting the result.

2 Likes

Please take the Fortran results with a grain of salt. By my estimation, they haven’t had nearly the optimization effort put into them as Rust/C/C++/Julia implementations.

Note that all the benchmarks on the site are run on a 2007-era core2 cpu. This is by no means the only discrepancy you’ll see when running the benchmarks on modern hardware.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/how-programs-are-measured.html

I and others have tried to get a more efficient implementation using threads instead of processes, and my results mirror yours. I think its a limitation of either the way Julia does allocations or garbage-collection? Possibly related to https://github.com/JuliaLang/julia/issues/33033.

There is one, but when I tried to submit it, the site’s maintainer clarified that only languages without builtin garbage collection are allowed to use an external memory pool:

2 Likes

If you wouldn’t mind taking a look at some of the other benchmarks on the site, some use an extremely naive threading strategy and spawn way too many tasks since I’m not super familiar with how to best partition these things. I feel like we could squeeze a bit more performance out by being a bit smarter, and you have better domain knowledge on how to accomplish that than I do.

For example, https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/mandelbrot-julia-7.html spawns 16,000 tasks.

1 Like

Is there a reason not to use @threads for y=1:n?

1 Like

There are some early exit conditions for the Mandelbrot calc, so the amount of time each loop iteration takes differs quite a bit. Experimentally, this way performs better.

1 Like

Oh wow, it is in the comment just below :smiley:

1 Like

Ok, so sounds like Distributed works quite well here and we’re just not sure why they are reporting it as so slow.

The fortran results do seem suspiciously slow compared to C, but Julia looks quite good even relative to C (though it’s of course always a bit slower), so I think the conclusion is the same.

On my (Intel i5-5300U from 2015) laptop I get timing similar to the website. I think in general making a lot of allocations and garbage-collecting them just isn’t something Julia excels at yet.

Shouldn’t there be an addprocs(n) to create multiple instances? If I add:

if nprocs() == 1
    addprocs(7)
end

Before the Node is defined the benchmark of the function drops by 3x.

Wait never mind, I read the whole page they are starting julia with -p4 that will create the other processes…

Your message made me curious, so I did some benchmarking, and it turns out adding addprocs() to the top of the script gives a 3% speedup versus invoking the script with -p4 (no idea why this is or why it should be the case). So I guess there is room for improvement on this benchmark after all. :wink:

EDIT: created a github issue in case there’s something that can be done about this: https://github.com/JuliaLang/julia/issues/35425

3 Likes

Actually I only know basic things about parallel programming :slight_smile: Anyway, I can only come up with simple partitioning:

@@ -66,12 +66,12 @@
     # For each row (each imaginary coordinate), spawn a thread to fill
     # out values. At small values of n, this is too fine-grained of
     # parallelism to really be efficient, but it works well for large n.
-    @sync for y=1:n
+    @sync for ys in Iterators.partition(1:n, 200)
         # Threads.@spawn allows dynamic scheduling instead of static scheduling
         # of Threads.@threads macro. See
         # https://github.com/JuliaLang/julia/issues/21017 . On some
         # computers this is faster, on others not.
-        Threads.@spawn @inbounds begin
+        Threads.@spawn @inbounds for y in ys
             ci = yvals[y]
             startofrow = (y - 1) * n ÷ 8
             # The first iteration within a row will generally return 0x00

Before:

julia> @benchmark mandelbrot(devnull, 16000)
BenchmarkTools.Trial:
  memory estimate:  42.73 MiB
  allocs estimate:  96032
  --------------
  minimum time:     1.105 s (0.21% GC)
  median time:      1.225 s (0.11% GC)
  mean time:        1.202 s (0.09% GC)
  maximum time:     1.236 s (0.11% GC)
  --------------
  samples:          5
  evals/sample:     1

After:

julia> @benchmark mandelbrot(devnull, 16000)
BenchmarkTools.Trial:
  memory estimate:  30.90 MiB
  allocs estimate:  165
  --------------
  minimum time:     1.027 s (0.00% GC)
  median time:      1.033 s (0.00% GC)
  mean time:        1.054 s (0.02% GC)
  maximum time:     1.146 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

It’s a bit… better. Not sure if this is going to give us measurable change in the game, though.

I think optimizing the serial part of the code would be more beneficial. For example, using Baselet.all instead of all from Base gives me 830 ms. I guess you can get a similar/more speedup by using @nexprs.

2 Likes

I’ve been unable to get a @nexprs based solution to be any faster, and it seems a bit silly to pull in a dependency for something like this. Any suggestions?

@@ -30,7 +30,8 @@
             for _=1:5
                 r, i, t = calc_sum(r, i, cr, ci)
             end
-            all(>(4.0), t) && return 0x00
+            @nexprs 8 k-> t[k] > 4.0 || continue
+            return 0x00
         end
     else
         for _=1:50

Actually, I’m not seeing any difference on my computer from using Baselet.all either:

--- mandelbrot.julia-7.jl	2020-03-03 15:35:30.625960909 -0500
+++ mandelbrot.julia-7.2.jl	2020-04-17 09:51:39.914201367 -0400
@@ -8,7 +8,7 @@
  tweaked for performance by maltezfaria and Adam Beckmeyer
 =#
 
-using Base.Cartesian
+using Base.Cartesian, Baselet
 
 # 0b01111111, 0b10111111, 0b11011111, 0b11101111, etc.
 const masks = (0x7f, 0xbf, 0xdf, 0xef, 0xf7, 0xfb, 0xfd, 0xfe)
@@ -30,7 +30,7 @@
             for _=1:5
                 r, i, t = calc_sum(r, i, cr, ci)
             end
-            all(>(4.0), t) && return 0x00
+            Baselet.all(>(4.0), t) && return 0x00
         end
     else
         for _=1:50

Any thoughts on why you’re seeing a better result? Maybe an AVX2 instruction is being used on your CPU?

julia> using BenchmarkTools

julia> module M7 include("mandelbrot.julia-7.jl") end
Main.M7

julia> module M72 include("mandelbrot.julia-7.2.jl") end
Main.M72

julia> @benchmark M7.mandelbrot($devnull, $16000)
BenchmarkTools.Trial: 
  memory estimate:  42.73 MiB
  allocs estimate:  96031
  --------------
  minimum time:     1.213 s (0.06% GC)
  median time:      1.216 s (0.08% GC)
  mean time:        1.218 s (0.34% GC)
  maximum time:     1.229 s (1.29% GC)
  --------------
  samples:          5
  evals/sample:     1

julia> @benchmark M72.mandelbrot($devnull, $16000)
BenchmarkTools.Trial: 
  memory estimate:  42.73 MiB
  allocs estimate:  96031
  --------------
  minimum time:     1.242 s (0.06% GC)
  median time:      1.246 s (0.08% GC)
  mean time:        1.248 s (0.35% GC)
  maximum time:     1.259 s (1.30% GC)
  --------------
  samples:          5
  evals/sample:     1

(@v1) pkg> st
Status `~/.julia/environments/v1/Project.toml`
  [9718e550] Baselet v0.1.0
  [6e4b80f9] BenchmarkTools v0.5.0

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 4

By using @nexprs I meant something like this:

@@ -20,7 +20,9 @@
             for _=1:5
                 r, i, t = calc_sum(r, i, cr, ci)
             end
-            all(>(4.0), t) && return 0x00
+            done = true
+            @nexprs 8 k -> done &= t[k] > 4.0
+            done && return 0x00
         end
     else
         for _=1:50

It’s often better to avoid branches for a tight loop like this. I see a speedup (~10%) equivalent to using Baselet.all.

Ah, so I guess I should be using --cpu-target=core2 when optimizing the benchmark game code, as done in that web site? I can confirm that I get no speedup with this flag.

1 Like

It seems you should now use --cpu-target=ivybridge for some reason, maybe they upgraded the machine or someone realized this is faster and the former restricts the generated assembly instructions too much. I see this option also works in Julia 1.0, but maybe it’s only effective in Julia 1.5; I only see it used for code updated to that version. Some code has not still been updated, and I see the tree code needs to be updated (threads, -t probably better than -p), and some other code where PHP beats Julia…

1 Like

I was stumbling across that benchmark too - the fastest Julia implementation taking 17.4s, while e.g. Java only takes 2.47s. Being 7x slower than Java is quite substantial. Especially since Julia tends to be faster than Java and even Ruby closing in at 23.8 seconds.

Another extremely slow benchmark for Julia is the “mandelbrot” benchmark here: GitHub - kostya/benchmarks: Some benchmarks of different languages

Java ~20s, Crystal ~23s, Go ~16s, Julia ~78s

Another extremely slow benchmark for Julia is the “mandelbrot” benchmark here

Lol, I think this is a benchmark, that tests some brainfuck implementations - so, this isn’t actually testing julia, but a brainfuck interpreter written in Julia.
I think @otde implemented a Julia brainfuck interpreter, which was beating all the others from that benchmark by a significant margin.

The BenchmarkGame seems a bit biased… I think at some point I had versions for all benchmarks, that were on eye level with C, but not all solutions were accepted, even though the code was without dependencies and algorithmically identical to the C++ solution…
At that point I stopped caring about that benchmark :smiley:

7 Likes

Hm… that’s a weird argument - following that train of thought you could also claim that all other benchmarks don’t test Julia - they just test the task that the code was written to do. In this case, interpret Brainfuck. In other cases, like the Julia microbenchmarks, multiplying matrices or parsing integers: “It’s not testing Julia, it’s testing matrix multiplication written in Julia”.

The same holds true for the main benchmark discussed here: It’s 8x slower than Java while it seems intuitively written in both languages. And the last posts from 2020 did a lot of experimentation but apparently did not find the root cause, just that it might be related to CPU architecture.

I now played around with Julia and compared it to Python and Rust - mainly to see how the syntax feels like. However, I immediately hit a performance wall - Julia was ~2x slower than Python and ~2.5x slower than Rust for the random stuff I tried. The point is not, that the code is random, pointless and very unoptimized (which it is, it was just to play around with the syntax), but that it’s the same code in all three languages. And Python was twice as fast for that same random code piece. That said, it does contain some elements that you might encounter in real life (even though you don’t really want to) - appending to arrays/lists/vectors of unknown size, string interpolation (e.g. for extensive logging). I have no interest in talking Julia small - Julia in general is a very fast language and certainly faster than Python on average, but for some cases (like those benchmarks or this here) it seems arbitrarily slow in comparison to other languages and I can’t really see why… And that’s an issue for me, since I don’t want to hit arbitrary performance walls for real projects (I am looking to rewrite my current Python project in Julia, Rust or maybe C++). Maybe I am doing something wrong that explains the time loss each iteration (please bear in mind I am more or less a novice in all three languages)?

using BenchmarkTools

function stupidloop(num_iterations)
    str_list = []
    for i in 0:num_iterations-1
        push!(str_list, "str_list, test_$(i + 3)_somethingsomething")
    end

    total_length = 0

    for line in str_list
        total_length += length(line)
    end

    println(total_length)
end

@benchmark stupidloop(1234567)

Result:

BenchmarkTools.Trial:
  memory estimate:  393.75 MiB
  allocs estimate:  8641476
  --------------
  minimum time:     1.125 s (35.64% GC)
  median time:      1.191 s (42.28% GC)
  mean time:        1.205 s (41.09% GC)
  maximum time:     1.291 s (44.98% GC)
  --------------
  samples:          5
  evals/sample:     1

→ mean: 1.205s

code = """
def stupidloop(num_iterations):
    str_list = []
    for i in range(num_iterations):
        str_list.append(f"str_list, test_{i + 3}_somethingsomething")

    total_length = 0

    for line in str_list:
        total_length += len(line)


    print(total_length)

print(stupidloop(1234567))
"""

import timeit
num = 30
print(f"mean: {timeit.timeit(stmt=code, number=num) / num}")

→ mean: 0.574s

Both loop print the same result (49506155) but the Python loop is twice as fast. I tried to add sizehint! for the Vector but it did not improve performance.

For completeness, same thing in Rust, being a bit faster:

use std::time::SystemTime;

fn stupid_loop(num_iterations: i32){
    //let mut result_num = Vec::new();
    let mut str_list = Vec::new();
    for i in 0..num_iterations{
        str_list.push(format!("str_list, test_{}_somethingsomething", i + 3));
    }

    let mut total_length = 0;

    for line in str_list{
        total_length += line.len();
    }

    println!("{}", total_length);

}


fn main() {
    let now = SystemTime::now();
    let num = 30;
    for _i in 0..num{
        stupid_loop(1234567);
    }

    match now.elapsed() {
        Ok(elapsed) => {
            println!("mean: {:.3}", (elapsed.as_millis() as f64) / (num as f64) / 1000.0);
        }
        Err(e) => {
            println!("Error: {:?}", e);
        }
    }
}

→ mean: 0.473

1 Like