Benchmarks game

Hi, this blog post shows bad results for julia in benchmarks game: https://benchmarksgame-team.pages.debian.net/benchmarksgame/

In comments the author said “One problem with Julia benchmarks on the Benchmarks Game is that improved Julia programs have not yet been contributed. How the programs are written really does matter.”

Performance is not a topic of my interest, so I report here in case some of you are attracted by the challenge.
Have fun.

1 Like

That site includes bad code for Julia (wrong types, unneded allocations, wrong functions…).

Many users have tried to contact the administrator to include optimized versions of that code, up to one hundred times faster.

There is already another thread about this:

I think many of the benchmarks on that site have been updated and the Julia results are pretty good now.

See https://benchmarksgame-team.pages.debian.net/benchmarksgame/which-programs-are-fastest.html#fastest for example.

3 Likes

About “bad results”, I stopped looking into tuning the slowest Julia programs there, when I realized Fasta, there unchanged [EDIT: results below are for my own old slightly changed unpublished version, that’s a bit faster], was actually fastest (beating C on my machine), more than 2x as fast as reported!

$ julia -O3 --cpu-target=core2 -t4   # -t is new in Julia 1.5

julia> using BenchmarkTools; include("fasta.jl")

julia> @btime fasta(25000000, devnull)
  991.491 ms (41601 allocations: 956.91 MiB)

The just over one sec. overhead is startup cost of Julia plus, compilation time for the program:

$ time julia -O3 --cpu-target=core2 -t4 fasta.jl 25000000 >/dev/null

real	0m2,003s   # 0m1,973s without --cpu-target=core2 used at BenchmarkGame test setup
user	0m5,761s
sys	0m0,619s

My understanding is that they do not want (some) tuned programs, such as precompiled with PackageCompiler.jl, which did reduce the startup a bit, but not to C levels.

Even without doing so, Julia is good for long-running programs:

$ time julia -O3 --cpu-target=core2 -t4 fasta.jl 2500000000 >/dev/null

real	1m50,248s
user	6m19,835s
sys	0m3,092s

Since it doing 100 times the work, I divide the time by 100, and get 1.1 sec, showing the startup cost amortized.

Whatever the optimization level we use (lower, makes for less compilation overhead), we can’t can’t compete with C and similar languages, where the compilation, is an extra step, not counted, if we must include it. And also, we have the startup overhead:

$ time julia -e "" -O0

real	0m0,217s
user	0m0,191s
sys	0m0,102s

Doesn’t mean scripting needs to be so bad, with e.g. -O1 option:

$ time julia -O1 --cpu-target=core2 fasta.jl 1
>ONE Homo sapiens alu
GG
>TWO IUB ambiguity codes
ctt
>THREE Homo sapiens frequency
tgagc

real	0m0,663s  # 0m0,838s on default opt. and 0m0,810s on highest -03
user	0m1,009s
sys	0m0,528s

He can’t be saying that in general, and while he refuses some submissions, I would want to be more clear about what’s allowed and what isn’t. E.g. .NET/C# is usually considered a JIT, and I noticed on some submission that the C# code gets compiled to .dll file. I assume AOT compiled. I understand they have that now, so our PackageCompiler.jl compiled code should also be allowed, to be fair. Is it only because it’s an extra step, not the standard compiler/environment?

I’m sorry.
I’ve just rechecked and found they have updated some of the Julia code. When I contacted them several months ago they were reluctant to do so.

@kristoffer.carlsson thanks for the link.
The latest benchmarks put Julia 5th, only behind C++/C/Rust/Fortran. Wow!
(note how much tighter the IQR is for Julia)
Toy benchmark program busy seconds compared to the fastest for selected programming language implementations.

Toy benchmark program busy seconds compared to the fastest for more programming language implementations.

Juan, yesterday you posted accusations to that blog which were not true:

“The Julia code used for that benchmarks was deliberately selected to be slow… but the administrator refuses to update it.”

We can all see from the benchmarks game public code repo that the Julia programs have been repeatedly re-worked and optimized.

We can all see that the new Julia programs are reviewed, accepted and shown on the public website without delay.

Please correct your comments on that blog.

5 Likes

I noticed (on my machine, not typical?), not only does Julia startup penalize our benchmark numbers (compared to e.g. C or Rust) but -p4 (while not for -t) adds 2.2 sec to startup (even -p0 adds time, and scales from bad to worse to e.g 11 sec), thus just this extra time is double the 1.112s I get for running the actual competing binary-tree C program!

So we really need to thread that program, or we never beat C.

$ time ~/julia-1.6.0-DEV/bin/julia -O3 --cpu-target=core2 -p4 -e ""

real	0m2,433s  # vs 0m0,222s without -p4
user	0m7,402s
sys	0m2,034s

With low or no optimization: -O0, can mitigate this, to the point with it running the actual program, not an empty one, is faster, but even then we could never beat C on [that] short running-program.

That said I get:

$ ~/julia-1.6.0-DEV/julia -O3 --cpu-target=core2 -p4
julia> @btime (GC.enable(false); binary_trees(devnull, 21); GC.enable(true););
  6.059 s (12586773 allocations: 384.16 MiB)

while I managed also lower (more correct, as machine not idle?):

julia> @timev (GC.enable(false); binary_trees(devnull, 21); GC.enable(true););
  5.793472 seconds (12.59 M allocations: 384.164 MiB)
elapsed time (ns): 5793472393
bytes allocated:   402825488
pool allocs:       12586781

julia> GC.gc(true)

julia> @time binary_trees(stdout, 21);
stretch tree of depth 22	 check: 8388607
2097152	 trees of depth 4	 check: 65011712
524288	 trees of depth 6	 check: 66584576
131072	 trees of depth 8	 check: 66977792
32768	 trees of depth 10	 check: 67076096
8192	 trees of depth 12	 check: 67100672
2048	 trees of depth 14	 check: 67106816
512	 trees of depth 16	 check: 67108352
128	 trees of depth 18	 check: 67108736
32	 trees of depth 20	 check: 67108832
long lived tree of depth 21	 check: 4194303
  6.921278 seconds (12.59 M allocations: 384.165 MiB, 11.29% gc time)
1 Like

With my timing I beat the threaded C version, and likely would match the fastest (Haskell) program, but that’s with excluding startup penalty at 1.95 sec. both would beat Julia:

$ time ~/fasta.gcc-7.gcc_run 25000000 >/dev/null

real	0m1,217s
user	0m4,648s
sys	0m0,016s

It seems to be using 4 threads (or 16 my machine has?), by dividing user/real = 3.8 (or scales badly for 16?)

I'm not sure I get an error by trying to limit to 4, or even very high:

$ time prlimit --nproc=320 ~/fasta.gcc-7.gcc_run 25000000 >/dev/null

libgomp: Thread creation failed: Resource temporarily unavailable

julia> @timev fasta(25000000, devnull);
  1.181311 seconds (79.99 k allocations: 960.094 MiB, 0.81% gc time)
elapsed time (ns): 1181311305
gc time (ns):      9618430
bytes allocated:   1006731712
pool allocs:       66969
non-pool GC allocs:1
malloc() calls:    13021
GC pauses:         18

Yes, slower with startup overhead:

$ time ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -O3 --cpu-target=core2 -t4 fasta.jl 25000000 >/dev/null

real	0m1,946s
user	0m6,419s
sys	0m0,608s

There is variablility on my machine (I tried to report lowest numbers above):

$ time ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -O3 --cpu-target=core2 -t4 ~/fasta.jl 25000000 >/dev/null

real	0m2,112s
user	0m6,044s
sys	0m0,657s

The best time I get so far (might be coincident) is with different opt/cpu-target (would also be allowed?):

$ time ~/julia-1.6.0-DEV-8f512f3f6d/bin/julia -O2 --cpu-target=amdfam10 -t4 fasta.jl 25000000 >/dev/null

real	0m1,974s
user	0m6,303s
sys	0m0,615s

amdfam10 probably uses instructions that aren’t present on the cpu the benchmarks are being run on. If the -O2 result holds up by itself, it might be worth opening an issue for. In fact, you should probably try running all the benchmarks under -O2; I have a suspicion that none of them are really benefitting from -O3, so the switch to -O2 might be fine universally.

1 Like

The lowest I get so far (while probably luck as I easily get 1.25 sec. on same settings), with MEMDEBUG build of Julia, and I guess despite that, and if anything because of doing GC.gc(true) first:

$ LD_PRELOAD=/home/pharaldsson_sym/mimalloc-master/out/release/libmimalloc.so ./julia -O3 -t4

julia> GC.gc(true); @timev (fasta(25000000, devnull);)
  1.179317 seconds (79.84 k allocations: 966.605 MiB, 0.45% gc time)
elapsed time (ns): 1179316777
gc time (ns):      5298347
bytes allocated:   1013558656
non-pool GC allocs:66823
malloc() calls:    13021
GC pauses:         17

GC.enable(false); program(); GC.enable(true) is helpful for some programs, but not that one.

For mandelbrot, at the website the difference between Julia and the fastest program (C++) is 1.13 sec. and I can shave off 0.5 sec (on my machine), of the Julia time, while more than half of that is the startup time of Julia:

I get 1.809575 seconds (and 2.187s with startup overhead, with -O3 2.257s), but a 6.5% better with disabling GC:

julia> GC.gc(true); @timev (GC.enable(false); mandelbrot(devnull, 16000); GC.enable(true););
  1.698622 seconds (96.04 k allocations: 42.732 MiB)
elapsed time (ns): 1698622180
bytes allocated:   44807664
pool allocs:       96029
non-pool GC allocs:7
malloc() calls:    3

Julia has 224,160 in the mem column at the website, and C++ has 26,256 (and Free Pascal has 8). I guess we should try to bring that number down, and I guess the Julia runtime itself accounts for most of it, thus the startup cost.

1 Like

I’m not sure this would get accepted:

When possible, use default GC; […] As a practical matter, the myriad ways to tune GC will not be accepted.

I guess switching it off and on might count as “tuning”. For the other programs this might be fine, however, if we have GCable code, it might be better to prevent it from allocating in the first place or reuse that memory.

I think we should just ignore that specific benchmark…
It’s pretty ridiculous, that you can’t use the same optimization as in C++ just because Julia has a GC - while the Julia code copying what C++ does is much simpler, has the same performance and doesn’t even rely on any dependency :smiley:

If you need to put up arbitrary rules like that, which are not the same between languages, it’s a strong indication that the benchmark is ill designed, since it’s not the benchmarking problem itself anymore that is shaping what one can do, but some arbitrary rules instead.

7 Likes

I agree, but to me it reads like it was the other way round: the benchmark was supposed to measure GC performance (sounds like a good idea, right?) and then languages without GC … well, just do whatever, I guess? :smiley: Hence the mismatch.

3 Likes

And then you put all the language on the same performance chart… Wat.

2 Likes

Yeah I know… But if you want to measure GC performance, make a benchmark that can’t be improved easily with a very common optimization :smiley: Or, yeah, don’t put them on the same chart, or at the very least also publish the result using the simple optimization.

But the current situation is just wrong. it gives the impression of “Julia can’t really be on the same level for this task because it has a GC”, which is obviously not true.

3 Likes

Yes, I googled, and it was tried (while GC never enabled again) in a Python program, and PR closed (ok for other?). I assume it’s the main guy commenting and closing:

It’s however a valid thing to do (or why have it in the standard library?), sometimes, when you know there are not going to be any frees for a while (I would like a macro that does, “prev = GC.enable(false); runs my code; GC.enabled(prev)” usable with with). I also posted this as a hint to a possible (fixable?) problem in tha code, or more likely other code.

EDIT: Having said that, the quote you (and he) gives, is from the description for the binary-tree code, that is meant to exercise the GC (an adaptation of “GCBench”), so doing this may be ok for other microbenchmarks on the page (at least mandelbrot, I didn’t see anything on disallowing such in its description or elsewhere I looked).

For spectnorm, I get 0.457363 seconds for 16 threads, only 11% slower than the C program (assuming it uses all 16 threads I have).

For 4 threads I get 64% slower, not 4x slower, so not scaling well (neither I guess the C program):

$ time julia -t4 -O2 --cpu-target=core2
julia> @timev main(5500);
  0.753225 seconds (979 allocations: 246.938 KiB)
elapsed time (ns): 753225326
bytes allocated:   252864
pool allocs:       976
malloc() calls:    3

Looking at the profile of the program, assuming I’m reading correctly, it seems the single “line” taking the most time is line 8:

Base.:+(a::m128d, b::m128d) =
    Base.llvmcall("""
        %res = fadd <2 x double> %0, %1
        ret <2 x double> %res
    """, m128d, Tuple{m128d, m128d}, a, b)

but “line 13”, just a small amount:

Base.:/(a::m128d, b::m128d) =
    Base.llvmcall("""
        %res = fdiv <2 x double> %0, %1
        ret <2 x double> %res
    """, m128d, Tuple{m128d, m128d}, a, b)

That seems very odd, wouldn’t division be slower? [Probably the profiler is inaccurate, with the latency long, but CPU getting to issue other instructions quickly after, before getting answer out.]

[EDIT:

It’s helpful to disable threading, to see what’s going on, otherwise, threading seems to take the single most time, also good no know about:
ProfileView.view(C=false, fontsize=40)

The rest of my post is answered, thanks, I assume, yes, jut not for the machine he benchmark runs on.]

And can you vectorize to do 4, 8 etc. additions (or divisions) at a time?

I get no error changing “<2 x double>” to “<4 x double>”, or even trying 160, but I didn’t run either using those definitions.