Blog post: Rust vs Julia in scientific computing

Elrod · July 19, 2023, 8:56pm

I’ve often seen sentiments like this repeated, but given the choice, I’d take RAII over GC.
Unfortunately, Julia’s semantics don’t make RAII possible.

So keep in mind that a C++ implementation does not need to and probably should not be manually calling malloc/free or using new/delete keywords.

using StrideArraysCore
f(x) = @fastmath typeof(x)(2) * x
g(x) = @fastmath typeof(x)(3) * x
h(x) = @fastmath typeof(x)(4) * x
@inline function rowwise!(m, X, f, g, h)
  num_rows, num_cols = size(X)
  @assert num_cols == 3
  @inbounds @simd ivdep for j = 1:num_rows
    m[j, 1] = f(X[j, 1])
    m[j, 2] = g(X[j, 2])
    m[j, 3] = h(X[j, 3])
  end
  return m
end

struct GarbageCollector end;
struct LibcMalloc end;
struct MiMalloc end;
struct JeMalloc end;
using jemalloc_jll, mimalloc_jll

malloc(::GarbageCollector, N) = Array{UInt8}(undef, N)
malloc(::LibcMalloc, N) = Libc.malloc(N)
malloc(::MiMalloc, N) =
  @ccall mimalloc_jll.libmimalloc.mi_malloc(N::Int)::Ptr{Cvoid}
malloc(::JeMalloc, N) =
  @ccall jemalloc_jll.libjemalloc.malloc(N::Int)::Ptr{Cvoid}
free(::GarbageCollector, x) = nothing
free(::LibcMalloc, x) = Libc.free(x)
free(::MiMalloc, x) =
  @ccall mimalloc_jll.libmimalloc.mi_free(x::Ptr{Cvoid})::Nothing
free(::JeMalloc, x) =
  @ccall jemalloc_jll.libjemalloc.free(x::Ptr{Cvoid})::Nothing

function foo(mem, X, f::F, g::G, h::H, n) where {F,G,H}
  x = 0.0
  num_rows, num_cols = size(X)
  for _ = 1:n
    p = malloc(mem, 8 * num_rows * num_cols)
    GC.@preserve p begin
      A = PtrArray(Base.unsafe_convert(Ptr{Float64}, p), (num_rows, num_cols))
      rowwise!(A, X, f, g, h)
      x += sum(A)
    end
    free(mem, p)
  end
  x
end
function foo_threaded(mem, X, f::F, g::G, h::H, n) where {F,G,H}
  x = Threads.Atomic{Float64}(0.0)
  Threads.@threads for _ in 1:Threads.nthreads()
    Threads.atomic_add!(x, foo(mem, X, f, g, h, n))
  end
  x[]
end

X = rand(100, 3);
@time foo(GarbageCollector(), X, f, g, h, 30)
@time foo(LibcMalloc(), X, f, g, h, 30)
@time foo(MiMalloc(), X, f, g, h, 30)
@time foo(JeMalloc(), X, f, g, h, 30)
@time foo_threaded(GarbageCollector(), X, f, g, h, 30)
@time foo_threaded(LibcMalloc(), X, f, g, h, 30)
@time foo_threaded(MiMalloc(), X, f, g, h, 30)
@time foo_threaded(JeMalloc(), X, f, g, h, 30)

@time foo(GarbageCollector(), X, f, g, h, 30_000_000)
@time foo(LibcMalloc(), X, f, g, h, 30_000_000)
@time foo(MiMalloc(), X, f, g, h, 30_000_000)
@time foo(JeMalloc(), X, f, g, h, 30_000_000)
@show Threads.nthreads();
@time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
@time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
@time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
@time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)

I get

julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
 19.134882 seconds (30.00 M allocations: 71.526 GiB, 12.61% gc time)
1.4034906850760502e10

julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
  3.349162 seconds (1 allocation: 16 bytes)
1.4034906850760502e10

julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
  3.026271 seconds (1 allocation: 16 bytes)
1.4034906850760502e10

julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
  3.278054 seconds (1 allocation: 16 bytes)
1.4034906850760502e10

julia> @show Threads.nthreads();
Threads.nthreads() = 36

julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
235.488171 seconds (1.08 G allocations: 2.515 TiB, 53.87% gc time)
4.957047359932209e11

julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
  7.382309 seconds (221 allocations: 21.234 KiB)
4.957047359932209e11

julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
  3.553753 seconds (221 allocations: 21.234 KiB)
4.957047359932209e11

julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
  3.986570 seconds (224 allocations: 21.328 KiB)
4.957047359932209e11

We have a single threaded and a multithreaded version.
The multithreaded just calls the single threaded Threads.nthreads() times, so perfect scaling means they take the same amount of time.

Single threaded, we have 12.61% GC time, but we’re 6x slower than the non-GC implementations.
Multithreaded, we have 53.87% gc time, and we’re over 66x and 59x slower than the good non-GC implementations (mimalloc and jemalloc), and 47.6x slower than the bad one (Linux default).

So

GC scales far worse with multithreaded allocations than even a bad malloc/free.
GC’s overhead is far higher than simply the reported GC time.

Elaborating…

There are of course much worse malloc/free implementations than the Linux default, but reasonable ones are built to handle multiple threads much better than Julia’s GC. For example, the glibc (Linux default) malloc has separate per-thread heaps, so that multiple threads will not have any contention, so long as the same threads free the memory that they allocated (many are also built to handle thread-migration of memory efficiently, e.g. leanM and mstress benchmarks test that here, although I’ve no idea how GC would compare or how much slower mimalloc gets per allocation).
mallocs are much more memory efficient than GC. GC has a lot of overhead, so the key to reducing that overhead is to run less often, and collect more memory per run. I.e., you allocate much more memory than fits into L3 cache in between your GC collections. In other words, you’re spilling L3 cache, and new allocations are all of cold memory. In comparison, eager freeing lets mallocs return memory that is already hot in cache, so you get much fewer cache misses. This is the primary reason why the performance difference is so many times larger than the GC percentage.

“2.” can be mitigated by escape analysis in Julia doing eager freeing in cases when it is possible (which it should be in this example), so this is not a fundamental Julia limitation. But it is relying on successful compiler analysis*, when RAII can give you this for free.
*Which I think, when implemented, can work very reliably, just like devirtualization is very reliable in type stable code.