While I understand your other points, this could apply to any cross-language benchmark I guess?
I’m imagining an alternative benchmark game concept where the code would have to be simple/clean/beautiful, instead of being mercilessly optimized. This would be subjective and prone to bias, of course, but I think the results of such a benchmark would at least stand for something, in contrast to the Debian benchmarks game.
I think that is why the benchmarks game also takes into account the code size, using compression. And when you plot the Pareto front, you get a nice curve where Julia rather shines
Source of the plot: my PhD thesis Machine learning and combinatorial optimization algorithms, with applications to railway planning - PASTEL - Thèses en ligne de ParisTech
cd(@__DIR__)
using Pkg
Pkg.activate(@__DIR__)
using CSV
using DataFrames
using Downloads
using Plots
using StatsBase
#=
The Computer Language Benchmarks Game
Sources:
- <https://salsa.debian.org/benchmarksgame-team/benchmarksgame>
- <https://benchmarksgame-team.pages.debian.net/benchmarksgame/>
Official data
- <https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/blob/master/public/data/README.md>
- <https://twitter.com/ChapelLanguage/status/1484581096604016647>
Energy consumption
- <https://github.com/greensoftwarelab/Energy-Languages>
- <https://sites.google.com/view/energy-efficiency-languages>
- <https://dl.acm.org/doi/10.1145/3136014.3136031>
"""
=#
plot_path = joinpath(@__DIR__, "..", "..", "..", "images", "autodiff")
data_path = Downloads.download(
"https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/raw/master/public/data/data.csv",
)
data = CSV.read(data_path, DataFrame)
rename!(
data,
"name" => "task",
"size(B)" => "size",
"cpu(s)" => "cpu",
"mem(KB)" => "mem",
"elapsed(s)" => "elapsed",
"busy(s)" => "busy",
)
data_rows_to_keep = (data[!, :status] .>= 0) .& (data[!, :task] .!= "simple")
data = data[data_rows_to_keep, :]
languages = unique(data[!, :lang])
tasks = unique(data[!, :task])
# Select run (id) with min cpu time
best_data_unfiltered = combine(
g -> first(first(groupby(g, :cpu; sort=true))), groupby(data, [:lang, :task])
);
# Keep only languages where all tasks were successful
lang_tasks = combine(groupby(best_data_unfiltered, :lang), nrow => :lang_tasks)
best_data = leftjoin(best_data_unfiltered, lang_tasks; on=:lang)
best_data_rows_to_keep = best_data[!, :lang_tasks] .== 10
best_data = best_data[best_data_rows_to_keep, :]
best_data_geomean = combine(groupby(best_data, :lang), :cpu => geomean, :size => geomean)
DataFrames.sort!(best_data_geomean, :cpu_geomean)
sort(languages)
colors = distinguishable_colors(8, [RGB(1, 1, 1), RGB(0, 0, 0)]; dropseed=true)
markers = [:circle, :octagon, :hexagon, :pentagon, :square, :diamond, :utriangle]
pl = plot()
for (i, row) in enumerate(eachrow(best_data_geomean))
lang = row[:lang]
cpu = row[:cpu_geomean] / minimum(best_data_geomean[!, :cpu_geomean])
size = row[:size_geomean] / minimum(best_data_geomean[!, :size_geomean])
color = colors[i % length(colors) + 1]
markershape = markers[i % length(markers) + 1]
scatter!(pl, [cpu], [size]; label=lang, color=color, markershape=markershape)
if lang == "julia"
annotate!(pl, [cpu * 1.05], [size], (lang, 15, color, :left))
else
annotate!(pl, [cpu * 1.05], [size], (lang, 7, color, :left))
end
end
plot!(
pl;
ylabel="Compressed code size (normalized)",
xlabel="CPU time (normalized)",
xscale=:log10,
yscale=:log10,
legend=false,
margin=5Plots.mm,
)
savefig(pl, joinpath(plot_path, "clbg.pdf"))
I think to some degree at least this is mitigated in the Benchmark Game by keeping multiple implementations per language around and by the fact that the maintainer goes to some lengths to emphasize that the game compares implementations, not languages. A while ago I looked at a detailed comparison between C and Julia and if I remember correctly for most benchmarks the clean C version and the Julia version had very similar runtimes.
Your benchmark concept can be purely simple/clean/beautiful code because it is imaginary.
In contrast, the benchmarks game has been available in an imperfect world, year after year.
What I’m really taking away from this plot is that I should learn OCaml. I definitely didn’t expect that to be the fastest language.
(What compression algorithm is being used? I feel like a tokenizer followed by Huffman coding might do a better job of measuring what we’d call “readability”–repetitive code is easily compressible, but not at all easy to read!)
I’m wondering how OCaml leads performance, it doesn’t at all on the Benchmarks game site?
That might have to do with the fact that I’m taking some filtering and averaging steps which are not on the website. I don’t guarantee my code is flawless, posted it in the response above in case anyone wants to play around
I’m looking at the benchmarks once again, and I noticed:
#define MAX_THREADS 64 [EDIT: It’s there in the program, but reading carefully always sets to 4 threads… I still think oversubscribing might sometimes help, but not proven for C or Julia by testing below.]
in the C program (which is the fastest implementation): fannkuch-redux C gcc #6 program (Benchmarks Game)
which is the fastest implementation. But the processor has only 4 cores. This can be useful (I believe, not just left in from running on a different computer), and is not disallowed. Except for Julia its program (like all) is run with -t 4 effectively, and might benefit from higher. And I think -t auto
would do the same.
I believe the benchmark owner insists on the same setting used by all the Julia programs.
The value that is chosen should be in the hands of the programmer (overrideable by command-line setting), set in the program (like possible for C and other languages) but I don’t think that is possible in Julia currently (it’s set at startup, and in newer versions of Julia you can take over threads that are added if you call C code, but I still believe that’s the only exception, maybe possible with some undocumented hack?).
I have 16 cores (can I limit, with cgroups or something, to 4 cores, to try running still with -t16?), so this does not have same effect:
$ time julia -t 16 -- f.jl 12
3968050
Pfannkuchen(12) = 65
real 0m4,927s
user 0m52,183s
sys 0m0,540s
$ time julia -t 4 -- f.jl 12
3968050
Pfannkuchen(12) = 65
real 0m9,510s
user 0m35,418s
sys 0m0,473s
$ time julia -t 32 -- f.jl 12
3968050
Pfannkuchen(12) = 65
real 0m4,633s
user 0m58,399s
sys 0m0,593s
$ time julia -t 64 -- f.jl 12
3968050
Pfannkuchen(12) = 65
real 0m5,149s
user 1m6,331s
sys 0m0,570s
I get somewhat better timing with 1.10, but interestingly e.g. consistently slightly worse for 9 vs 8 threads, but again better with even higher when up to -t 12 or again for 16):
$ time julia +1.10 -t 8 -- f.jl 12
The latest version of Julia in the `1.10` channel is 1.10.0-beta3+0.x64.linux.gnu. You currently have `1.10.0-beta2+0.x64.linux.gnu` installed.
real 0m5,247s
user 0m36,933s
sys 0m0,502s
$ time julia +1.10 -t 9 -- f.jl 12
real 0m6,117s
user 0m40,552s
sys 0m0,522s
$ time julia +1.10 -t 16 -- f.jl 12
real 0m4,594s
user 0m50,897s
sys 0m0,819s
$ time julia +1.10 -t 4 -- f.jl 12
real 0m9,323s
user 0m35,318s
sys 0m0,509s
EDIT:
taskset seems to do exactly what I want, limit, here to 4 cores, and then it scales up to -t4 that seems optimal, at least for this program, and gets slightly slower after that (unlike for C?)::
$ time taskset --cpu-list 0-3 julia +1.10 -t 4 -- f.jl 12
real 0m9,255s
user 0m34,501s
sys 0m0,352s
$ time taskset --cpu-list 0-3 julia +1.10 -t 8 -- f.jl 12
3968050
real 0m9,506s
user 0m35,599s
sys 0m0,337s
$ time taskset --cpu-list 0-3 ./fannkuchredux.gcc-6.gcc_run 12
3968050
Pfannkuchen(12) = 65
real 0m2,713s
user 0m10,755s
sys 0m0,004s
$ time taskset --cpu-list 0-3 ./fannkuchredux.gcc-6.gcc_run 12
3968050
Pfannkuchen(12) = 65
real 0m2,789s
user 0m10,893s
sys 0m0,004s
but if I change to:
I got at first 3,101s, then:
$ time taskset --cpu-list 0-3 ./fannkuchredux.gcc-6.gcc_run 12
3968050
Pfannkuchen(12) = 65
real 0m2,750s
user 0m10,776s
sys 0m0,000s
My machine is a bit loaded now, running repeatedly I get higher numbers, that overlap, so unclear the high define help at all, nor helps (unlike a similar setting for Julia).
On my (6-day old) Julia master (with a loaded machine, scalability was interesting/non-default settings needed; on 1.10 however 6 5 (non-GC) threads, giving 6.867 sec, might be optimal, faster than 4, 5, 6 or seemingly any number of threads on master):
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-julia-3.html
$ time julia-0cb5a0ecd4/bin/julia -O3 -t auto bt.jl 21
[was often over 10 sec]
real 0m9,355s
user 1m21,641s
sys 0m5,126s
vs 1.10:
real 0m8,281s
user 1m18,219s
sys 0m7,359s
vs. with fewer threads better
$ time julia-0cb5a0ecd4/bin/julia -O3 -t 4 bt.jl 21
real 0m9,312s
user 0m22,536s
sys 0m2,981s
[the default is half as many GC threads, I believe, e.g. 4 above, but at least here more GC, but not regular threads, helps.]
$ time julia-0cb5a0ecd4/bin/julia -O3 -t8 --gcthreads=4 bt.jl 21
stretch tree of depth 22 check: 8388607
real 0m11,879s
user 0m59,262s
sys 0m4,751s
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=4 bt.jl 21
real 0m8,207s
user 0m25,817s
sys 0m2,456s
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=8 bt.jl 21
real 0m7,401s
user 0m29,006s
sys 0m2,284s
11 or maybe 9 GC threads seemed optimal (though lower on 1.10 below):
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=9 bt.jl 21
real 0m7,258s
user 0m28,938s
sys 0m2,551s
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=10 bt.jl 21
real 0m7,416s
user 0m30,421s
sys 0m2,429s
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=11 bt.jl 21
[I often got higher]
real 0m7,242s
user 0m30,358s
sys 0m2,537s
with 1.10:
$ time julia -O3 -t4 --gcthreads=11 bt.jl 21
real 0m7,131s
user 0m17,553s
sys 0m4,323s
though also low on 1.10 with fewer GC threads:
$ time julia -O3 -t4 --gcthreads=3 bt.jl 21
real 0m7,178s
user 0m16,927s
sys 0m4,430s
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=16 bt.jl 21
real 0m7,281s
user 0m33,586s
sys 0m2,531s
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=17 bt.jl 21
WARNING: running Julia with 17 GC threads on 16 CPU cores
real 0m7,896s
user 0m37,081s
sys 0m2,459s
$ time julia-0cb5a0ecd4/bin/julia -O3 -t4 --gcthreads=64 bt.jl 21
WARNING: running Julia with 64 GC threads on 16 CPU cores
real 0m7,763s
user 0m33,859s
sys 0m2,870s
The objective is to get “real” down (though for some also “user” down, or up if it helps get “real” down).
While this applies for my machine, the one used in the benchmark game has only 4 cores. Can anyone test on their machine, and maybe those that have 4 (or 8) cores.