Julia position in the Debian Benchmark Game can be improved, and categorization of some Julia there is unfair

non-Jedi · November 7, 2024, 6:27pm

I don’t think this is fair. I don’t necessarily agree that the rules Isaac (@igouy in this thread) has set out for the benchmarks game are the most interesting set of rules and have argued with him a little bit on this exact issue, but he’s always been consistent in how he applies the rules. Just above the bit you quote, the binary trees directions say:

When possible, use default GC; otherwise use per node allocation or use a library memory pool.

He’s clearly drawing a line between “library” memory pools and “custom” (inline) memory pools. Languages without a default GC are explicitly allowed to use the former but not the latter while languages with default GC aren’t allowed “library” nor “custom” memory pools.

Not sure if there was a different effort I wasn’t aware of, but I don’t think this is generally true of the ones in the BenchmarksGame.jl repo. As far as I know, all the fastest Julia implementations in the benchmarks game are now as fast or faster than those in that repo with two exceptions:

Binary trees, which implemented a custom memory pool rather than using default GC. The reason this isn’t allowed and whether it should be allowed has been discussed several times both on this forum and in the benchmarks game issue tracker.
Mandelbrot, which used 32-bit floats while all the other implementations were using 64-bit floats and so couldn’t be accepted since it gives wrong results.

Unfortunately at the time I was contributing to the benchmarks game, the extra load time from using an external package had a huge effect on total runtime since the game explicitly includes JIT compilation, so it was never worthwhile . This might no longer be the case with recent Julia versions.

!

nsajko · November 7, 2024, 7:32pm

I think you meant “with” here, not “without”.

In any case, the point stands that some languages, those with a default GC, are handed a different problem to solve than other languages, those without a default GC.

non-Jedi · November 7, 2024, 7:50pm

Oops. Thanks. I’ll edit so later readers don’t get confused.

sdanisch · November 7, 2024, 11:12pm

To add to that, I think it’s pretty unique, that Julia can so easily avoid the GC and make a memory pool in so few lines.
I have no idea what goal could be reached by putting a penalty on that - especially since there are already tons of different versions for the same language and problems, with threads, with simd, without etc… why not permit a version with gc and one without?

Anyways, don’t get me wrong, the rules are clear and not unfairly applied, but I do wonder why the Julia community should put any time in such a comparison, where there’s no chance to break even with the fast languages whatsoever, even though Julia can mostly reach the same performance

Juliac will be interesting to at least get rid of the compilation penalty, but then it still stands that Julia isn’t allowed to do the same optimisations as other languages.

Benny · November 8, 2024, 1:23am

For that specific benchmark, I think most languages can make an adequate memory pool within a reasonably sized file. I can only hazard a guess that the intent of differentiating “library” vs “custom” allocation strategies is that libraries are available to anyone and designed to handle more problems than one benchmark.

Could at least compete with the other GC languages. I agree on separate benchmarks, but it’s not too difficult a caveat to filter out the non-GC-so-import-whatever languages. Earlier a Haskell entry was mentioned as using GHC.Compact to evade its GC, it’s interesting to note that it’s not much of an improvement of the next lower Haskell entry that is documented to precede that optimization. So after wading through the caveats of a GC benchmark, Julia has room to improve.

Palli · November 14, 2024, 4:29pm

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-java-7.html

Why is Java here 2.7x faster than Julia? It doesn’t seems cheating, i.e. not using memory pools (is using a thread pool that is ok; Julia also uses threads). All faster programs, i.e. the non-GC languages, use memory pools.

(Lisp and) OCaml aren’t far behind, and it has a comment:

(* Gc.set { (Gc.get()) with Gc.minor_heap_size = 1024 * 1024; max_overhead = -1; }; *)

It’s actually a hint to the GC, and maybe Java is similarly good without needing a hint. Can Julia match this (i.e. without a memory pool)? Feel free to also implement with a pool.

~~I’m unsure, does~~ Java (and seemingly C#) allow heap compaction (I’m unsure if this is still used in practice, only in old GC implementations?)? Does it make it faster? I can’t actually see how it can be faster, if it moves memory around (that’s an overhead), which is not possible in Julia (since it must be able to call C?), but why was that ever done historically at least? If not for performance reasons?

https://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html

Step 2a: Deletion with Compacting
To further improve performance, in addition to deleting unreferenced objects, you can also compact the remaining referenced objects. By moving referenced object together, this makes new memory allocation much easier and faster.

I’m unclear on how this can work at all without changing all pointers to the still-used regions, and indirection doesn’t seem like would help for performance. Pointers aren’t used much in Julia, but you can get pointers to heap objects, e.g. for calling C, so it rules this out. Is there a way to support it when you know pointers not needed?

All the faster GC-langugages there seem to have or support compacting GC (Java is using its default GC, one of many, haven’t confirmed it’s compacting).

Haskell is even faster, 3.4x Julia’s time, but it opts into some non-GC tracked memory, unsure if it’s compacting, seems more like a memory pool.

https://www.reddit.com/r/haskell/comments/fxr4oj/ghccompact_and_the_new_gc/

We’re 47% slower than even Erlang, and it compacts (unless no longer using such GC).

Go is NOT compacting, and is 2x slower than Julia, so kind of supports my theory.

Go trades the complexity of a compacting GC for forcing you to restart the app regularly if you have a workload with plenty of heap allocation.

Lisp SBCL (2.36x faster) probably compacts, i.e. this is about some alternative, not good enough, because it doesn’t(?):
Parallel garbage collection for SBCL https://applied-langua.ge/~hayley/swcl-gc.pdf

The collector reclaims memory and allows for bump allocation without the collector needing to move objects, using a mark-region heap based on Immix … The parallel garbage collector using one core usually is slower than the copying collector of SBCL, outperforms copying with two cores, and continues to scale with more cores. …
The collector is not ready to be used in production yet, lacking support for the immobile space of SBCL, and lacking any kind of compaction.

Go moving to compacting/copying GC (but not yet?):

Three garbage collectors: Java, Python, and Julia https://indico.cern.ch/event/1329685/contributions/5597296/attachments/2777423/4840807/hsf-accelerator-pivarski-gc.pdf

physicists are more interested in Julia than, say, Rust or Lua

Python has 3 generations (also not compacting) and Julia “1 bit = 2 generations”.

foobar_lv2 · November 15, 2024, 10:16am

The big advantage of heap compaction is in allocation: Smallish allocations can be done with a bump allocator, which is very fast. This is because all new allocs come out of an entirely contiguous region of free memory. Hence, heap allocation in java is about as fast as stack allocation in C.

This is indeed how this works in java. The GC selects a region with few live objects, and then evacuates the region. For this, it needs to relocate all the live objects to a new one, and rewrite all object references that point to one of the evacuated objects.

Now, with FFI, you have a problem: If some C code has pointers to such an object, then the GC cannot find these pointers, and hence cannot evacuate the region. In other words, the region needs to be pinned, and the user needs to make sure to do the right incantations for this, on pain of hard to debug crashes.

This is a royal pain. JNI / JNA are famous for how painful they are to use. At my dayjob, we recently got hosed by https://bugs.openjdk.org/browse/JDK-8276094 – before java22, the default G1 garbage collector sucked at the region pinning thing. Hence, your program will randomly crash with OOM if there are too many JNI critical sections that happen to coincide with moments of GC pressure.

What I’m saying is: Compacting GC is better for performance, but makes FFI a pain.

Julia has chosen relatively painless FFI over performance of small heap allocations.

Julia can get away with that choice, because it makes an effort to avoid too many small heap allocs – through language design, API patterns, and escape analysis (JVM hotspot-C2 escape analysis is a joke. Graal got better, though!).

Palli · November 15, 2024, 10:19am

You mean without indirection? Ok, you can change pointers, in theory, or in practice seemingly, on the heap (but not for Julia? because of its semantics? You can call C from Java, so how do they manage that?), but you also need to change in the call stack?! ~~And even in registers?~~ It seems like an almost impossible task, ~~or ruling out some register allocation optimization~~? In Julia GC is triggered by allocations, i.e. in the middle of a function. I suppose it can be triggered there in Java too, but likely needs to happen after returning from a call in Java, between calls/stack frames?

Is that about opting out of compacting? I was thinking the same thing, maybe a split heap is worthwhile, one part with compacting, probably for smaller objects, and another non-compacting, for larger objects. Moving is O(n) so likely you don’t want to do that too much for large objects and/or too often. I guess you rely on things settling down, and after a while they don’t move much.

But it’s only allocations that get faster, not deallocations, unless done in reverse order. So it seems to me Libc.malloc and Libc.free have sort of same/symmetric performance, but in Java malloc" is basically no cast (I read at some point almost one assembly instruction) while free is not. You of course never do free directly there, but it’s still done by the GC, and the total work is more, even if moved once. So you make up for it in cache effects? Is fragmentation really that bad?

Palli · November 27, 2024, 5:30pm

I see Julia, and notice it actually dropped to 7th from 4th place.

Is it about Julia programs being categorized differently, and only the Julia programs from the general category now in the graph?

It could also be that other languages just got faster since this is a relative rank, but Julia was before within 2x of fastest, now just over 3x (on average), because of outlier mentions it seems, but also the lower outlier seems far of from 1x reading of the logarithmic graph.

If this is ab out categorization (not e.g. change of rank with 1.11) then we should work on getting our best programs possible (back) into the general category.

[It’s intriguing to compare the rank with Fortran. It is missing one program, still has a legal rank. Most often we are close, faster, or slower, usually within 2x, only in few cases much slower or faster.]

Palli · December 12, 2024, 9:55pm

See my experiment with Immix GC:

Such 33% faster (37% faster based on mean) would make Julia climb up to 4.7436 sec. (4.46 sec. if their benchmark uses mean, calculated from 37%) on that benchmark (and improve Julia’s overall rank likely), e.g. climb past C#, and Erlang which is actually faster, and approach Java at 2.62 sec. still 81% slower than it, and 4.9x slower than fastest (C++). We can and should tune it to be able to exploit threads better (now 3 threads, not 4, is optimal).

Topic		Replies	Views
Benchmark game challenge and some optimization questions Performance	29	2788	January 13, 2024
Does Debian's BenchmarkGames show representative performance? Community benchmark	40	2869	August 18, 2022
Funny Benchmark with Julia (no longer) at the bottom Performance benchmark	149	6015	November 4, 2023
Benchmarks game Performance	20	3781	May 13, 2020
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	4586	August 6, 2021

Julia position in the Debian Benchmark Game can be improved, and categorization of some Julia there is unfair

Related topics