Benchmark game challenge and some optimization questions

While I understand your other points, this could apply to any cross-language benchmark I guess?

1 Like

I’m imagining an alternative benchmark game concept where the code would have to be simple/clean/beautiful, instead of being mercilessly optimized. This would be subjective and prone to bias, of course, but I think the results of such a benchmark would at least stand for something, in contrast to the Debian benchmarks game.

I think that is why the benchmarks game also takes into account the code size, using compression. And when you plot the Pareto front, you get a nice curve where Julia rather shines


Source of the plot: my PhD thesis Machine learning and combinatorial optimization algorithms, with applications to railway planning - PASTEL - Thèses en ligne de ParisTech

Code to generate it
cd(@__DIR__)
using Pkg
Pkg.activate(@__DIR__)

using CSV
using DataFrames
using Downloads
using Plots
using StatsBase

#=
The Computer Language Benchmarks Game

Sources:
- <https://salsa.debian.org/benchmarksgame-team/benchmarksgame>
- <https://benchmarksgame-team.pages.debian.net/benchmarksgame/>

Official data
- <https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/blob/master/public/data/README.md>
- <https://twitter.com/ChapelLanguage/status/1484581096604016647>

Energy consumption
- <https://github.com/greensoftwarelab/Energy-Languages>
- <https://sites.google.com/view/energy-efficiency-languages>
- <https://dl.acm.org/doi/10.1145/3136014.3136031>
"""
=#

plot_path = joinpath(@__DIR__, "..", "..", "..", "images", "autodiff")

data_path = Downloads.download(
    "https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/raw/master/public/data/data.csv",
)

data = CSV.read(data_path, DataFrame)
rename!(
    data,
    "name" => "task",
    "size(B)" => "size",
    "cpu(s)" => "cpu",
    "mem(KB)" => "mem",
    "elapsed(s)" => "elapsed",
    "busy(s)" => "busy",
)
data_rows_to_keep = (data[!, :status] .>= 0) .& (data[!, :task] .!= "simple")
data = data[data_rows_to_keep, :]

languages = unique(data[!, :lang])

tasks = unique(data[!, :task])

# Select run (id) with min cpu time
best_data_unfiltered = combine(
    g -> first(first(groupby(g, :cpu; sort=true))), groupby(data, [:lang, :task])
);

# Keep only languages where all tasks were successful
lang_tasks = combine(groupby(best_data_unfiltered, :lang), nrow => :lang_tasks)
best_data = leftjoin(best_data_unfiltered, lang_tasks; on=:lang)
best_data_rows_to_keep = best_data[!, :lang_tasks] .== 10
best_data = best_data[best_data_rows_to_keep, :]

best_data_geomean = combine(groupby(best_data, :lang), :cpu => geomean, :size => geomean)
DataFrames.sort!(best_data_geomean, :cpu_geomean)

sort(languages)

colors = distinguishable_colors(8, [RGB(1, 1, 1), RGB(0, 0, 0)]; dropseed=true)

markers = [:circle, :octagon, :hexagon, :pentagon, :square, :diamond, :utriangle]

pl = plot()
for (i, row) in enumerate(eachrow(best_data_geomean))
    lang = row[:lang]
    cpu = row[:cpu_geomean] / minimum(best_data_geomean[!, :cpu_geomean])
    size = row[:size_geomean] / minimum(best_data_geomean[!, :size_geomean])
    color = colors[i % length(colors) + 1]
    markershape = markers[i % length(markers) + 1]
    scatter!(pl, [cpu], [size]; label=lang, color=color, markershape=markershape)
    if lang == "julia"
        annotate!(pl, [cpu * 1.05], [size], (lang, 15, color, :left))
    else
        annotate!(pl, [cpu * 1.05], [size], (lang, 7, color, :left))
    end
end
plot!(
    pl;
    ylabel="Compressed code size (normalized)",
    xlabel="CPU time (normalized)",
    xscale=:log10,
    yscale=:log10,
    legend=false,
    margin=5Plots.mm,
)
savefig(pl, joinpath(plot_path, "clbg.pdf"))
5 Likes

I think to some degree at least this is mitigated in the Benchmark Game by keeping multiple implementations per language around and by the fact that the maintainer goes to some lengths to emphasize that the game compares implementations, not languages. A while ago I looked at a detailed comparison between C and Julia and if I remember correctly for most benchmarks the clean C version and the Julia version had very similar runtimes.

3 Likes

Your benchmark concept can be purely simple/clean/beautiful code because it is imaginary.

In contrast, the benchmarks game has been available in an imperfect world, year after year.

What I’m really taking away from this plot is that I should learn OCaml. :sweat_smile: I definitely didn’t expect that to be the fastest language.

(What compression algorithm is being used? I feel like a tokenizer followed by Huffman coding might do a better job of measuring what we’d call “readability”–repetitive code is easily compressible, but not at all easy to read!)

I’m wondering how OCaml leads performance, it doesn’t at all on the Benchmarks game site?

That might have to do with the fact that I’m taking some filtering and averaging steps which are not on the website. I don’t guarantee my code is flawless, posted it in the response above in case anyone wants to play around

I’m looking at the benchmarks once again, and I noticed:

#define MAX_THREADS 64 [EDIT: It’s there in the program, but reading carefully always sets to 4 threads… I still think oversubscribing might sometimes help, but not proven for C or Julia by testing below.]

in the C program (which is the fastest implementation): fannkuch-redux C gcc #6 program (Benchmarks Game)

which is the fastest implementation. But the processor has only 4 cores. This can be useful (I believe, not just left in from running on a different computer), and is not disallowed. Except for Julia its program (like all) is run with -t 4 effectively, and might benefit from higher. And I think -t auto would do the same.

I believe the benchmark owner insists on the same setting used by all the Julia programs.

The value that is chosen should be in the hands of the programmer (overrideable by command-line setting), set in the program (like possible for C and other languages) but I don’t think that is possible in Julia currently (it’s set at startup, and in newer versions of Julia you can take over threads that are added if you call C code, but I still believe that’s the only exception, maybe possible with some undocumented hack?).

I have 16 cores (can I limit, with cgroups or something, to 4 cores, to try running still with -t16?), so this does not have same effect:

$ time julia -t 16 -- f.jl 12
3968050
Pfannkuchen(12) = 65

real	0m4,927s
user	0m52,183s
sys	0m0,540s

$ time julia -t 4 -- f.jl 12
3968050
Pfannkuchen(12) = 65

real	0m9,510s
user	0m35,418s
sys	0m0,473s

$ time julia -t 32 -- f.jl 12
3968050
Pfannkuchen(12) = 65

real	0m4,633s
user	0m58,399s
sys	0m0,593s

$ time julia -t 64 -- f.jl 12
3968050
Pfannkuchen(12) = 65

real	0m5,149s
user	1m6,331s
sys	0m0,570s

I get somewhat better timing with 1.10, but interestingly e.g. consistently slightly worse for 9 vs 8 threads, but again better with even higher when up to -t 12 or again for 16):

$ time julia +1.10 -t 8 -- f.jl 12
The latest version of Julia in the `1.10` channel is 1.10.0-beta3+0.x64.linux.gnu. You currently have `1.10.0-beta2+0.x64.linux.gnu` installed.
real	0m5,247s
user	0m36,933s
sys	0m0,502s

$ time julia +1.10 -t 9 -- f.jl 12
real	0m6,117s
user	0m40,552s
sys	0m0,522s

$ time julia +1.10 -t 16 -- f.jl 12
real	0m4,594s
user	0m50,897s
sys	0m0,819s

$ time julia +1.10 -t 4 -- f.jl 12
real	0m9,323s
user	0m35,318s
sys	0m0,509s

EDIT:
taskset seems to do exactly what I want, limit, here to 4 cores, and then it scales up to -t4 that seems optimal, at least for this program, and gets slightly slower after that (unlike for C?)::

$ time taskset --cpu-list 0-3 julia +1.10 -t 4 -- f.jl 12
real	0m9,255s
user	0m34,501s
sys	0m0,352s

$ time taskset --cpu-list 0-3 julia +1.10 -t 8 -- f.jl 12
3968050
real	0m9,506s
user	0m35,599s
sys	0m0,337s
$ time taskset --cpu-list 0-3 ./fannkuchredux.gcc-6.gcc_run 12
3968050
Pfannkuchen(12) = 65

real	0m2,713s
user	0m10,755s
sys	0m0,004s


$ time taskset --cpu-list 0-3 ./fannkuchredux.gcc-6.gcc_run 12
3968050
Pfannkuchen(12) = 65

real	0m2,789s
user	0m10,893s
sys	0m0,004s

but if I change to:

I got at first 3,101s, then:
$ time taskset --cpu-list 0-3 ./fannkuchredux.gcc-6.gcc_run 12
3968050
Pfannkuchen(12) = 65

real	0m2,750s
user	0m10,776s
sys	0m0,000s

My machine is a bit loaded now, running repeatedly I get higher numbers, that overlap, so unclear the high define help at all, nor helps (unlike a similar setting for Julia).

On my (6-day old) Julia master (with a loaded machine, scalability was interesting/non-default settings needed; on 1.10 however 6 5 (non-GC) threads, giving 6.867 sec, might be optimal, faster than 4, 5, 6 or seemingly any number of threads on master):

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-julia-3.html

$ time julia-0cb5a0ecd4/bin/julia -O3 -t auto bt.jl 21

[was often over 10 sec]

real	0m9,355s
user	1m21,641s
sys	0m5,126s

vs 1.10:

real	0m8,281s
user	1m18,219s
sys	0m7,359s


vs. with fewer threads better

$ time julia-0cb5a0ecd4/bin/julia -O3 -t 4 bt.jl 21

real	0m9,312s
user	0m22,536s
sys	0m2,981s

[the default is half as many GC threads, I believe, e.g. 4 above, but at least here more GC, but not regular threads, helps.]

$ time julia-0cb5a0ecd4/bin/julia -O3 -t8 --gcthreads=4 bt.jl 21
stretch tree of depth 22	 check: 8388607
real	0m11,879s
user	0m59,262s
sys	0m4,751s


$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=4 bt.jl 21
real	0m8,207s
user	0m25,817s
sys	0m2,456s

$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=8 bt.jl 21
real	0m7,401s
user	0m29,006s
sys	0m2,284s


11 or maybe 9 GC threads seemed optimal (though lower on 1.10 below):

$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=9 bt.jl 21
real	0m7,258s
user	0m28,938s
sys	0m2,551s


$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=10 bt.jl 21
real	0m7,416s
user	0m30,421s
sys	0m2,429s

$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=11 bt.jl 21

[I often got higher]

real	0m7,242s
user	0m30,358s
sys	0m2,537s

with 1.10:
$ time julia -O3 -t4 --gcthreads=11 bt.jl 21
real	0m7,131s
user	0m17,553s
sys	0m4,323s

though also low on 1.10 with fewer GC threads:
$ time julia -O3 -t4 --gcthreads=3 bt.jl 21
real	0m7,178s
user	0m16,927s
sys	0m4,430s


$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=16 bt.jl 21

real	0m7,281s
user	0m33,586s
sys	0m2,531s


$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=17 bt.jl 21
WARNING: running Julia with 17 GC threads on 16 CPU cores

real	0m7,896s
user	0m37,081s
sys	0m2,459s


$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=64 bt.jl 21
WARNING: running Julia with 64 GC threads on 16 CPU cores

real	0m7,763s
user	0m33,859s
sys	0m2,870s

The objective is to get “real” down (though for some also “user” down, or up if it helps get “real” down).

While this applies for my machine, the one used in the benchmark game has only 4 cores. Can anyone test on their machine, and maybe those that have 4 (or 8) cores.