Blog post: Rust vs Julia in scientific computing

Can you show some example(s) of those type instabilities in cases where they are actually relevant (not by design)? I have type instabilities in my packages, but where that is actually convenient. I don’t remember not being able to make some code type-stable because of some inherent type-instability of Base functions (or functions of packages which are meant to work for performant code). You mean things that should be type-stable in Base, but are not?

(or, put more simply, these should be (are?) open issues?)

3 Likes

I wonder if the fraction of people adopting JET will change when static compilation will become available.

2 Likes

What’s intriguing here is that we’ve managed to separate static compilation from static analysis. Static compilation is available in several forms today, but they do not necessarily require us to do static analysis. Most statically compiled languages combine the two into one step.

2 Likes

Apologies if someone has already mentioned this, but I wanted to address one small technical point about GC vs manually-managed memory. First, let us all agree that allocation-free will nearly always be better (which it sounds like both Julia and Rust are sometimes capable of), and restrict ourselves to the case where any decent solution to a problem requires some form of dynamic memory allocation. In such cases, is it really clear that you get better performance from manual management? I don’t know if you’ve looked at the implementation of C/C++'s free, but it’s not exactly trivial. There are papers that have addressed this issue and often they seem to find that the two are pretty competitive:

https://www.researchgate.net/publication/340314454_Garbage_Collection_vs_Manual_Memory_Management_Performance_and_Memory_Consumption_Issues

A possibly-tolerable analogy is if I’m cooking a meal, does the strategy I use to load dirty dishes in the dishwasher matter? If I load them one-by-one (aka, like manual memory management), I never have a big pile of dishes next to the sink, but I pay a small overhead to opening and closing the dishwasher each time. Conversely, if I let them pile up (aka, like GC), I get some economies of scale from “batch-loading”; but if I let the piles grow too large, I end up wasting a lot of time managing the risk of the entire stack tipping over.

In my view, given that they are fairly equivalent in performance, I’d choose GC for any system that isn’t very memory-restricted (e.g., excluding embedded systems which profit a lot from a manually-managed memory scheme).

In Julia, because type-instability is often associated with GC usage, there’s a natural tendency to think “GC = bad performance.” But this is more a statement about “bad code” than a real comparison between the two memory management strategies.

34 Likes

Some companies are switching from Go (which uses GC) to Rust (which uses ownership) because Rust provides more consistent performance.

4 Likes

With manual memory management you can choose the appropriate allocation and collection strategy for use case. “One by one” isn’t the only option.

With GC you can choose the appropriate allocation and collection strategy for use case. “One by one” isn’t the only option.


If in your manually managed language you’re okay with pre-allocating a bunch of memory or an arena or whatever ahead of time, and then re-using those buffers during a long running calculation and then freeing them at the end, one could also do that in Julia, and the performance differences between GC and manual management (insofar as they exist) will be even less noticeable or relevant.

8 Likes

In fact, this is the only way to build reliable high-frequency trading systems where you are in the context of long-running calculation the whole trading session: you cannot afford the “one by one” approach and even less a CG intervention in the middle of a trading session. And Julia shines at the task.

I can imagine that there are lots of fields where consistency matters - and as long as you are not in a memory-restricted environment, Julia will do the job just fine.

I’m aware of a couple HFT firms using Rust, but none using Julia; do you know of one? That’s not bait—I’m genuinely curious

1 Like

The thing I always get annoyed by with the discussion about GC is that GC is a feature added to the language. Not having GC is the absence of a feature. You can do manual memory management in julia just fine if that’s what you’re into, it’s just not especially common (though it’s also not uncommon)

Rust’s borrow checker is different because it actually is a positive feature (though it’s a feature that’s only really possible by removing a bunch of other features), and it can do things that actually are legitimately hard to replicate in Julia. But Rust’s borrow checker’s performance is not something that sets it apart from julia, you can replicate whatever performance the borrow checker gives just fine.

The borrow checker is different because you can get most of the performance benefits of manual memory management and buffer re-use, without the safety headaches of actually verifying by hand that what you’re doing is memory safe.

9 Likes

In the discord article it said that Go’s GC latency was the problem, because it showed spikes every 2 mins. In particular, they traced it back to the GC having to scan all objects every 2 mins, among which is a very large LRU cache with lots of objects. So it was not the GC’s performance compared to the uptime of the program that made them switch to Rust.

Of course you can preallocate everything in Julia. But IIUC then also Julia’s GC would need to do the full sweep like Go’s GC every once in a while and this might cause similar spikes, or no?

Unless someone is concerned with latency, then I think @ranocha’s post above gave a nice example of a scientific computation where Julia’s GC latency/runtime is really not a computation’s bottleneck, if the simulation is to run for >10s or so.

2 Likes

Sounds like it’s simply a bad implementation of the GC if it’s being triggered and pausing execution in a program that doesn’t allocate, but I don’t know and I don’t care how Go’s GC works. The existance of bad GCs doesn’t mean that GC is categorically bad.

Julia’s GC will not stop execution if you’re not allocating memory.

2 Likes

Indeed. However, one must then make sure to not make a single allocation anywhere else, which automatically excludes all uses of function barriers to hide type unstable code or not?

I am not arguing about Julia having a GC or even it being bad, same applies for Go.

No, just don’t do any allocations in the part of your program that’s actually latency sensitive and you’ll be fine.

If you’re worried about GC performance, you should be even more worried about type unstable code in a hot loop.

2 Likes

(Specifically in cases like this you probably want to be using something like FunctionWrappers rather than Function barriers to make your code type stable in cases like this).

3 Likes

Modest type instabilities can be covered by union splitting. Indeed, this is the only way that finite iterators can be type “stable”, as they must return a value and state when possible or nothing when exhausted. But this also applies elsewhere that union splitting can be relevant.

julia> A = convert(Vector{Union{Int,Float64}},rand((1.0,2),100)); B = similar(A);

julia> @btime $B .= 2 .* $A;
  75.386 ns (0 allocations: 0 bytes)
1 Like

Seems that the Discord example was one where 1) the code allocates, 2) they need performance to be regular

That doesn’t mean that the total time was larger with the GC, but that they can’t afford pauses because of the nature of the application (user interface at a huge level). It seems reasonable to me that for that specific requirements a GC is not the best approach.

At the same time the fact that Discord was acceptable with a GC makes me pretty confident that anything I’ll ever do won’t be bottlenecked by that.

Ps. They do mention that Go forces GC to be triggered every 2 minutes even without allocations, that’s sort of bizarre.

1 Like

That’s what I mean by bad GC. It wasn’t that the pauses are long or whatever, it’s that the GC interferes with your program even if you’re not interacting with it. That’s the sort of horrible crap that gives GC such a bad reputation.

2 Likes

I am using Julia personally, and I do hope other people don’t do it, not because Julia is slow but because it’s super fast to iterate on your model and push to production, so if the competition don’t do it I would be happy :)).
Coding a super fast HFT system needs a whole different level of unsafe practices in threads/IO/… and doing shenanigans to write branchless code constantly checking generated ASM. Writing it in Rust kinda defeats the point. I also have heard some shops are switching to Rust which for me is not understandable.

4 Likes

I’ve often seen sentiments like this repeated, but given the choice, I’d take RAII over GC.
Unfortunately, Julia’s semantics don’t make RAII possible.

So keep in mind that a C++ implementation does not need to and probably should not be manually calling malloc/free or using new/delete keywords.

using StrideArraysCore
f(x) = @fastmath typeof(x)(2) * x
g(x) = @fastmath typeof(x)(3) * x
h(x) = @fastmath typeof(x)(4) * x
@inline function rowwise!(m, X, f, g, h)
  num_rows, num_cols = size(X)
  @assert num_cols == 3
  @inbounds @simd ivdep for j = 1:num_rows
    m[j, 1] = f(X[j, 1])
    m[j, 2] = g(X[j, 2])
    m[j, 3] = h(X[j, 3])
  end
  return m
end

struct GarbageCollector end;
struct LibcMalloc end;
struct MiMalloc end;
struct JeMalloc end;
using jemalloc_jll, mimalloc_jll

malloc(::GarbageCollector, N) = Array{UInt8}(undef, N)
malloc(::LibcMalloc, N) = Libc.malloc(N)
malloc(::MiMalloc, N) =
  @ccall mimalloc_jll.libmimalloc.mi_malloc(N::Int)::Ptr{Cvoid}
malloc(::JeMalloc, N) =
  @ccall jemalloc_jll.libjemalloc.malloc(N::Int)::Ptr{Cvoid}
free(::GarbageCollector, x) = nothing
free(::LibcMalloc, x) = Libc.free(x)
free(::MiMalloc, x) =
  @ccall mimalloc_jll.libmimalloc.mi_free(x::Ptr{Cvoid})::Nothing
free(::JeMalloc, x) =
  @ccall jemalloc_jll.libjemalloc.free(x::Ptr{Cvoid})::Nothing

function foo(mem, X, f::F, g::G, h::H, n) where {F,G,H}
  x = 0.0
  num_rows, num_cols = size(X)
  for _ = 1:n
    p = malloc(mem, 8 * num_rows * num_cols)
    GC.@preserve p begin
      A = PtrArray(Base.unsafe_convert(Ptr{Float64}, p), (num_rows, num_cols))
      rowwise!(A, X, f, g, h)
      x += sum(A)
    end
    free(mem, p)
  end
  x
end
function foo_threaded(mem, X, f::F, g::G, h::H, n) where {F,G,H}
  x = Threads.Atomic{Float64}(0.0)
  Threads.@threads for _ in 1:Threads.nthreads()
    Threads.atomic_add!(x, foo(mem, X, f, g, h, n))
  end
  x[]
end

X = rand(100, 3);
@time foo(GarbageCollector(), X, f, g, h, 30)
@time foo(LibcMalloc(), X, f, g, h, 30)
@time foo(MiMalloc(), X, f, g, h, 30)
@time foo(JeMalloc(), X, f, g, h, 30)
@time foo_threaded(GarbageCollector(), X, f, g, h, 30)
@time foo_threaded(LibcMalloc(), X, f, g, h, 30)
@time foo_threaded(MiMalloc(), X, f, g, h, 30)
@time foo_threaded(JeMalloc(), X, f, g, h, 30)

@time foo(GarbageCollector(), X, f, g, h, 30_000_000)
@time foo(LibcMalloc(), X, f, g, h, 30_000_000)
@time foo(MiMalloc(), X, f, g, h, 30_000_000)
@time foo(JeMalloc(), X, f, g, h, 30_000_000)
@show Threads.nthreads();
@time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
@time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
@time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
@time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)

I get

julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
 19.134882 seconds (30.00 M allocations: 71.526 GiB, 12.61% gc time)
1.4034906850760502e10

julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
  3.349162 seconds (1 allocation: 16 bytes)
1.4034906850760502e10

julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
  3.026271 seconds (1 allocation: 16 bytes)
1.4034906850760502e10

julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
  3.278054 seconds (1 allocation: 16 bytes)
1.4034906850760502e10

julia> @show Threads.nthreads();
Threads.nthreads() = 36

julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
235.488171 seconds (1.08 G allocations: 2.515 TiB, 53.87% gc time)
4.957047359932209e11

julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
  7.382309 seconds (221 allocations: 21.234 KiB)
4.957047359932209e11

julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
  3.553753 seconds (221 allocations: 21.234 KiB)
4.957047359932209e11

julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
  3.986570 seconds (224 allocations: 21.328 KiB)
4.957047359932209e11

We have a single threaded and a multithreaded version.
The multithreaded just calls the single threaded Threads.nthreads() times, so perfect scaling means they take the same amount of time.

Single threaded, we have 12.61% GC time, but we’re 6x slower than the non-GC implementations.
Multithreaded, we have 53.87% gc time, and we’re over 66x and 59x slower than the good non-GC implementations (mimalloc and jemalloc), and 47.6x slower than the bad one (Linux default).

So

  1. GC scales far worse with multithreaded allocations than even a bad malloc/free.
  2. GC’s overhead is far higher than simply the reported GC time.

Elaborating…

  1. There are of course much worse malloc/free implementations than the Linux default, but reasonable ones are built to handle multiple threads much better than Julia’s GC. For example, the glibc (Linux default) malloc has separate per-thread heaps, so that multiple threads will not have any contention, so long as the same threads free the memory that they allocated (many are also built to handle thread-migration of memory efficiently, e.g. leanM and mstress benchmarks test that here, although I’ve no idea how GC would compare or how much slower mimalloc gets per allocation).
  2. mallocs are much more memory efficient than GC. GC has a lot of overhead, so the key to reducing that overhead is to run less often, and collect more memory per run. I.e., you allocate much more memory than fits into L3 cache in between your GC collections. In other words, you’re spilling L3 cache, and new allocations are all of cold memory. In comparison, eager freeing lets mallocs return memory that is already hot in cache, so you get much fewer cache misses. This is the primary reason why the performance difference is so many times larger than the GC percentage.

“2.” can be mitigated by escape analysis in Julia doing eager freeing in cases when it is possible (which it should be in this example), so this is not a fundamental Julia limitation. But it is relying on successful compiler analysis*, when RAII can give you this for free.
*Which I think, when implemented, can work very reliably, just like devirtualization is very reliable in type stable code.

30 Likes