Julia position in the Debian Benchmark Game can be improved, and categorization of some Julia there is unfair

I don’t think this is fair. I don’t necessarily agree that the rules Isaac (@igouy in this thread) has set out for the benchmarks game are the most interesting set of rules and have argued with him a little bit on this exact issue, but he’s always been consistent in how he applies the rules. Just above the bit you quote, the binary trees directions say:

When possible, use default GC; otherwise use per node allocation or use a library memory pool.

He’s clearly drawing a line between “library” memory pools and “custom” (inline) memory pools. Languages without a default GC are explicitly allowed to use the former but not the latter while languages with default GC aren’t allowed “library” nor “custom” memory pools.

Not sure if there was a different effort I wasn’t aware of, but I don’t think this is generally true of the ones in the BenchmarksGame.jl repo. As far as I know, all the fastest Julia implementations in the benchmarks game are now as fast or faster than those in that repo with two exceptions:

  1. Binary trees, which implemented a custom memory pool rather than using default GC. The reason this isn’t allowed and whether it should be allowed has been discussed several times both on this forum and in the benchmarks game issue tracker.
  2. Mandelbrot, which used 32-bit floats while all the other implementations were using 64-bit floats and so couldn’t be accepted since it gives wrong results.

Unfortunately at the time I was contributing to the benchmarks game, the extra load time from using an external package had a huge effect on total runtime since the game explicitly includes JIT compilation, so it was never worthwhile . This might no longer be the case with recent Julia versions.

:eyes:!

8 Likes

I think you meant “with” here, not “without”.

In any case, the point stands that some languages, those with a default GC, are handed a different problem to solve than other languages, those without a default GC.

1 Like

Oops. Thanks. I’ll edit so later readers don’t get confused.

1 Like

To add to that, I think it’s pretty unique, that Julia can so easily avoid the GC and make a memory pool in so few lines.
I have no idea what goal could be reached by putting a penalty on that - especially since there are already tons of different versions for the same language and problems, with threads, with simd, without etc… why not permit a version with gc and one without?

Anyways, don’t get me wrong, the rules are clear and not unfairly applied, but I do wonder why the Julia community should put any time in such a comparison, where there’s no chance to break even with the fast languages whatsoever, even though Julia can mostly reach the same performance :person_shrugging:

Juliac will be interesting to at least get rid of the compilation penalty, but then it still stands that Julia isn’t allowed to do the same optimisations as other languages.

8 Likes

For that specific benchmark, I think most languages can make an adequate memory pool within a reasonably sized file. I can only hazard a guess that the intent of differentiating “library” vs “custom” allocation strategies is that libraries are available to anyone and designed to handle more problems than one benchmark.

Could at least compete with the other GC languages. I agree on separate benchmarks, but it’s not too difficult a caveat to filter out the non-GC-so-import-whatever languages. Earlier a Haskell entry was mentioned as using GHC.Compact to evade its GC, it’s interesting to note that it’s not much of an improvement of the next lower Haskell entry that is documented to precede that optimization. So after wading through the caveats of a GC benchmark, Julia has room to improve.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-java-7.html

Why is Java here 2.7x faster than Julia? It doesn’t seems cheating, i.e. not using memory pools (is using a thread pool that is ok; Julia also uses threads). All faster programs, i.e. the non-GC languages, use memory pools.

(Lisp and) OCaml aren’t far behind, and it has a comment:

(* Gc.set { (Gc.get()) with Gc.minor_heap_size = 1024 * 1024; max_overhead = -1; }; *)

It’s actually a hint to the GC, and maybe Java is similarly good without needing a hint. Can Julia match this (i.e. without a memory pool)? Feel free to also implement with a pool.

I’m unsure, does Java (and seemingly C#) allow heap compaction (I’m unsure if this is still used in practice, only in old GC implementations?)? Does it make it faster? I can’t actually see how it can be faster, if it moves memory around (that’s an overhead), which is not possible in Julia (since it must be able to call C?), but why was that ever done historically at least? If not for performance reasons?

https://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html

Step 2a: Deletion with Compacting
To further improve performance, in addition to deleting unreferenced objects, you can also compact the remaining referenced objects. By moving referenced object together, this makes new memory allocation much easier and faster.

I’m unclear on how this can work at all without changing all pointers to the still-used regions, and indirection doesn’t seem like would help for performance. Pointers aren’t used much in Julia, but you can get pointers to heap objects, e.g. for calling C, so it rules this out. Is there a way to support it when you know pointers not needed?

All the faster GC-langugages there seem to have or support compacting GC (Java is using its default GC, one of many, haven’t confirmed it’s compacting).

Haskell is even faster, 3.4x Julia’s time, but it opts into some non-GC tracked memory, unsure if it’s compacting, seems more like a memory pool.

https://www.reddit.com/r/haskell/comments/fxr4oj/ghccompact_and_the_new_gc/

We’re 47% slower than even Erlang, and it compacts (unless no longer using such GC).

Go is NOT compacting, and is 2x slower than Julia, so kind of supports my theory.

Go trades the complexity of a compacting GC for forcing you to restart the app regularly if you have a workload with plenty of heap allocation.

Lisp SBCL (2.36x faster) probably compacts, i.e. this is about some alternative, not good enough, because it doesn’t(?):
Parallel garbage collection for SBCL https://applied-langua.ge/~hayley/swcl-gc.pdf

The collector reclaims memory and allows for bump allocation without the collector needing to move objects, using a mark-region heap based on Immix … The parallel garbage collector using one core usually is slower than the copying collector of SBCL, outperforms copying with two cores, and continues to scale with more cores. …
The collector is not ready to be used in production yet, lacking support for the immobile space of SBCL, and lacking any kind of compaction.

Go moving to compacting/copying GC (but not yet?):

Three garbage collectors: Java, Python, and Julia https://indico.cern.ch/event/1329685/contributions/5597296/attachments/2777423/4840807/hsf-accelerator-pivarski-gc.pdf

physicists are more interested in Julia than, say, Rust or Lua

Python has 3 generations (also not compacting) and Julia “1 bit = 2 generations”.

1 Like

The big advantage of heap compaction is in allocation: Smallish allocations can be done with a bump allocator, which is very fast. This is because all new allocs come out of an entirely contiguous region of free memory. Hence, heap allocation in java is about as fast as stack allocation in C.

This is indeed how this works in java. The GC selects a region with few live objects, and then evacuates the region. For this, it needs to relocate all the live objects to a new one, and rewrite all object references that point to one of the evacuated objects.

Now, with FFI, you have a problem: If some C code has pointers to such an object, then the GC cannot find these pointers, and hence cannot evacuate the region. In other words, the region needs to be pinned, and the user needs to make sure to do the right incantations for this, on pain of hard to debug crashes.

This is a royal pain. JNI / JNA are famous for how painful they are to use. At my dayjob, we recently got hosed by https://bugs.openjdk.org/browse/JDK-8276094 – before java22, the default G1 garbage collector sucked at the region pinning thing. Hence, your program will randomly crash with OOM if there are too many JNI critical sections that happen to coincide with moments of GC pressure.

What I’m saying is: Compacting GC is better for performance, but makes FFI a pain.

Julia has chosen relatively painless FFI over performance of small heap allocations.

Julia can get away with that choice, because it makes an effort to avoid too many small heap allocs – through language design, API patterns, and escape analysis (JVM hotspot-C2 escape analysis is a joke. Graal got better, though!).

4 Likes

You mean without indirection? Ok, you can change pointers, in theory, or in practice seemingly, on the heap (but not for Julia? because of its semantics? You can call C from Java, so how do they manage that?), but you also need to change in the call stack?! And even in registers? It seems like an almost impossible task, or ruling out some register allocation optimization? In Julia GC is triggered by allocations, i.e. in the middle of a function. I suppose it can be triggered there in Java too, but likely needs to happen after returning from a call in Java, between calls/stack frames?

Is that about opting out of compacting? I was thinking the same thing, maybe a split heap is worthwhile, one part with compacting, probably for smaller objects, and another non-compacting, for larger objects. Moving is O(n) so likely you don’t want to do that too much for large objects and/or too often. I guess you rely on things settling down, and after a while they don’t move much.

But it’s only allocations that get faster, not deallocations, unless done in reverse order. So it seems to me Libc.malloc and Libc.free have sort of same/symmetric performance, but in Java malloc" is basically no cast (I read at some point almost one assembly instruction) while free is not. You of course never do free directly there, but it’s still done by the GC, and the total work is more, even if moved once. So you make up for it in cache effects? Is fragmentation really that bad?