Does Debian's BenchmarkGames show representative performance?

  1. Source?
  2. I just measured that integrate(sin, 0, 1, 100000000) takes 960 ms in Julia, 930 ms in Rust and 73 seconds in Python. My point is that hot loops are fast in Julia and this is certainly true.
1 Like

E.g. here - the benchmark description got changed to allow some languages to use an objectively better allocation scheme for this specific benchmark while denying others that same luxury. I also don’t buy their argument that “it’s all about GC” - then why have language implementations without GC in the first place? Also, why is this not prominently explained in the task description?

I don’t even have to disable GC to achieve similar performance to memory pooled implementations, I’m just not allowed to use them because “we have GC anyway”.

I don’t dispute that! In fact, I regularly optimize julia code way beyond the reasonable limit of what’s required or appropriate for the problem at hand :slight_smile:

4 Likes

One thing I disagree with is…

That doesn’t seem to be something said by Volker Weissmann, just your strawman.

… benchmark description got changed to allow some languages…

No, the description was changed to make what was intended more obvious.

I also don’t buy their argument that…

If you are determined not to understand then you will most likely succeed.

The one major issue with the benchmark games is its inconsistent with its rules around compilation. For C, Fortran, etc. it doesn’t measure the compilation time. With Julia it does. These codes are simple enough that more are even compatible with StaticCompiler.jl, and all are compatible with PackageCompiler.jl, so they are able to build a .so binaries to be ran just like C or Fortran, but for some reason that’s forbidden for just Julia? Julia does pretty well on the benchmarks regardless, but the difference that’s left from C/Fortran is precisely that more of the process is measured with Julia process than the others, so take that as you will. I would think being consistent and precompiling for all languages capable, always including the compilation as part of the benchmark, or just showing both results would be better. But as it stands, this benchmark has a big * on it because it takes an explanation as to why the measurements are not really a 1-1 comparison between languages. Honestly, always timing GCC would be an interesting option to really see the workflow impact, but :person_shrugging:

So does it show representative performance? Yes, though for some languages that’s representative of compilation + run time, and for others it’s run time, and it’s unstated in the chart which is which.

13 Likes

When Julia is presented like this?

“Julia features optional typing, multiple dispatch, and good performance, achieved using type inference and just-in-time (JIT) compilation, implemented using LLVM.”

Julia 1.7 Documentation, Introduction

That is one way to use it. Why not use the JIT for Fortran?

LFortran is a modern open-source (BSD licensed) interactive Fortran compiler built on top of LLVM. It can execute user’s code interactively to allow exploratory work (much like Python, MATLAB or Julia) as well as compile to binaries with the goal to run user’s code on modern architectures such as multi-core CPUs and GPUs.

When there’s multiple ways to compile, why not benchmark them all are share the information? Wouldn’t be interesting to see all of the different cases?

4 Likes

Julia’s next generation of static compilation is coming, so in a year or two it could be worth another attempt.

As for gc, I think it’s good that some benchmarks show other forms of memory management outperforming tracing gc, since that does happen in practice.

1 Like

I totally encourage you to do that!

1 Like

Seems like I hit a nerve :sweat_smile: I was referring to this part of the original post:

Julia is advertised as a fast language. Whether Julia fulfills this promise depends on how one defines “performance”. One way would be to do what the debian benchmarksgame did and implement a simple problem with a hot loop in this language.

I don’t know about you, but to me that sounds like saying that BenchmarksGame is representative when it comes to performance :person_shrugging: Either way, it’s just my opinion and since the opinion of the community was asked in the original thread, that’s what I gave :slight_smile:

But I want to understand! That’s why the very next sentence is a question about why languages with GC & languages without GC are compared using a benchmark that (by your own words) is meant to compare GC! I’m fine with julia lagging behind java in this one, because truth be told our GC is not optimized there. What I don’t agree with is having languages that don’t even have a GC lead the benchmark that’s supposedly comparing GC performance and allow them to implement whatever strategy they feel is best for this benchmark, instead of having to optimize for the general case the language is intended for like a GC has to. This same argument can be made for any GC vs. no-GC language in this benchmark, it’s not exclusive to julia.

So let me ask directly: What exactly is that benchmark supposed to show?

4 Likes

… this part of the original post…

Thank you for confirming that you put your words into his mouth.

So disagreeing with “One way to do that would be to do what BenchmarksGame did” is putting words in someones mouth, gotcha :sweat_smile:

1 Like

I think that the discussion better stay productive…

3 Likes

The rules of binary-trees say

When possible, use default GC; otherwise use per node allocation or use a library memory pool.

For example, Rust doesn’t have a default GC so the Rust binary-trees implementation uses an arena allocator from a package (“bumpalo”).

In this spirit, perhaps it would make sense to have both GC and arena package implementations for Julia. That way it would be possible to see how Julia’s default GC compares to an arena, within the same language. @igouy would that work for you?

Well, this depends on your definition of “GC”. If you take it as “automatic memory management”, i.e. you don’t have to manually allocate & deallocate memory, Rust very much has a default “garbage collector” that manages it for you. You don’t have to allocate memory yourself, ever (unlike C here! You can’t allocate the tree on the stack, so the only option left is to call malloc or use a library to call malloc for you). The compiler inserts alloc and dealloc calls for you, based on lifetime inferences due to Rust semantics (except for lifetimes and dealloc, this is very similar to how GC based languages do this - they often forego dealloc in favor of periodically cleaning allocated memory, based on different indicators like reference counting or tracing, etc). EDIT: Not to mention, Rust has Rc, which does reference counting (an accepted form of GC) as part of its standard library as well. Why reach for an external library when you have a perfectly good form of GC at hand already?

It just turns out that this is absolutely terrible for performance here, so every language that successfully made the argument that in truth they don’t have a GC (or truly don’t), got to use a library implemented memory pool. This is what I’m objecting to - if the goal is to compare GC implementations, why have languages that argue they don’t have GC in that benchmark at all? All it ends up showing is that for this task, using a memory pool as your allocation scheme is (probably) near optimal. It only adds noise to an otherwise useful benchmark comparing different general purpose GC implementations.

4 Likes

As-before I encourage Chris Rackauckas or anyone else who’s interested to make those measurements and share those measurements.

I’m asking in particular whether an arena package implementation would be accepted into Benchmarks Game. I think it would be a useful benchmark comparison to see the difference between a GC implementation and an arena implementation within the same language. That would help users interpret the benchmark results. Setting up benchmarking infrastructure is a significant job (as you know better than most), and it’s useful to have a trustworthy platform for evaluating these questions.

@palli and I did some binary tree tests recently measuring influence of GC and startup times. With the MWE

using BenchmarkTools

struct Node
    l::Union{Node,Nothing}
    r::Union{Node,Nothing}
end

function make(n)
    n === 0 ? Node(nothing, nothing) : Node(make(n - 1), make(n - 1))
end

function check(node)
    node.l === nothing ? 1 : 1 + check(node.l) + check(node.r)
end

function binary_trees(io, n)
    write(io, "stretch tree of depth $(n+1)\t check: $(check(make(n+1)))\n")

    long_tree::Node = make(n)

    d = 4
    while d <= n
        niter = 1 << (n - d + 4)
        ct = Vector{Int}(undef, niter)
        GC.enable(false)
        let d = d
            Threads.@threads for i = 1:niter
                ct[i] = check(make(d))
            end
        end
        GC.enable(true)
        GC.gc(false)
        c = sum(ct)
        write(io, "$niter\t trees of depth $d\t check: $c\n")
        d += 2
    end

    write(io, "long lived tree of depth $n\t check: $(check(long_tree))\n")
end#function

isinteractive() || @time binary_trees(stdout, parse(Int, ARGS[1]))

we get something like

  3.159 s (302689310 allocations: 9.04 GiB) # no GC, measured in VS Code via @btime
TotalMilliseconds : 4656,7325 # no GC, incl. startup & compilation measured in PS
  4.545 s (302689301 allocations: 9.04 GiB) # delayed GC, measured in VS Code via @btime
TotalMilliseconds : 5303,0138 # delayed GC,  incl. startup & compilation measured in PS
  6.321 s (305666479 allocations: 9.07 GiB) # current recordholder, measured in VS Code
TotalMilliseconds : 7163,3355 # current recordholder, incl. startup & compilation measured in PS

We were asking ourselves if delaying GC is conforming to the benchmark rules?

1 Like

I love seeing you all so passionate about something I don’t understand! It is nice being introduced to “new” topics :blush:

I am a bit curious though when you speak about performance, most of us think about speed of execution. Do we happen to have any mesaures for performance instead in regards to writeability of code, number of lines etc.? I often feel that a lot of the time that is a huge factor in performance of a language, since my understanding is that at the end of the day “all languages can achieve same performance”.

Kind regards

2 Likes