How to avoid GC of Julia structs within C/C++/Rust data structures?

Disclaimer — I’m still pretty new to Julia, but I think I’ve done enough diligence to warrant asking this.

I’m trying to use moodycamel::ConcurrentQueue as a high-performance replacement for Julia’s DualLinkedConcurrentRingQueue. There’s a C interface for moodycamel::ConcurrentQueue which makes the Julia wrapper side pretty trivial, at least for isbits types. However, the issue I’m running into is around garbage collection. I’d like to have the Julia object get enqueued into the moodycamel::ConcurrentQueue and not get garbage collected till it gets dequeued. (And more generally, if I have a native data structure holding Julia objects, I’d like them not to get GC’d for the duration of them being accessible in the data structure.)

Following a perusal of the relevant pages in the Julia manual, as well as a few related Discourse topics, here are the only applicable options I’m aware of:

  1. Create a global (or at least whose lifetime doesn’t exceed that of the elements in the queue) ConcurrentDict (or ReentrantLock-protected IdDict, or similar structure) holding the objects, which ensures their GC liveness.
  2. On enqueue, copy the objects to a C-friendly format; on dequeue, convert back to Julia format, and deallocate the native memory.
  3. Port the implementation to Julia.

A helpful trick might be to “untrack” the objects in question from GC on enqueue, and then “retrack” them on dequeue. (Or if they show up in multiple native structures, I suppose something like refcounting could work.) But I don’t think this is possible.

Am I missing something? Is non-isbits Julia interop with native data structures “doomed” (either sacrifice performance or reinvent the wheel)?

1 Like

I’m not aware of any other options for yanking objects out of the GC. There is an internal Base.@_gc_preserve_begin, but it won’t preserve things after leaving the scope, which isn’t much better than GC.@preserve. However, instead of a ReentrantLock-protected IdDict you could try a mostly lock-free ConcurrentDict? Though, it sounds like a performance killer in this context.

const d = ConcurrentDict{UInt, Any}()
d[objectid(obj)] = obj  # on enqueue
modify!(_ -> nothing, d, objectid(obj)) # after dequeue

It’s a good point @sgaureConcurrentDict may well have better performance characteristics in this case. However, the use of nearly any such “shadow data structure” (except perhaps a preallocated vector) incurs sufficient overhead as to make it pointless to use a native data structure in the first place.

Also, the way the GC works would make it hard to implement anything faster. It has a “mark” pass, where it marks all objects which are reachable from module globals, tasks, etc. It does this by traversing recursively down through structs. Then there’s the “sweep” pass. Those objects in the heap which are not marked, are freed.

One could imagine that some objects in the heap are directly marked as not to be GC’ed, but this would have to be done recursively too, when it disappears from sight (i.e. are enqueued in the external queue). It would be resource demanding, probably more so than inserting it in a global structure.

There are some hooks for using different GCs, and even custom types with custom mark & sweep functionality (like @ccall jl_new_foreign_type(..., mymarkfun, mysweepfun, ...), but that would be quite involved to implement, I think.

Do you have benchmarks that show moodycamel::ConcurrentQueue is faster than DualLinkedConcurrentRingQueue? In general, writing this type of concurrent data-structure is much easier with a GC (since you don’t have to worry about free races), so it wouldn’t surprise me if the Julia implementation is faster.

I second that, it would be very interesting to see some numbers and this kind of comparison.

It’s also worth figuring out how you’re handling these Julia objects on the Julia side. If the risk is premature GC, then presumably there are no references to the object from the Julia program. We can’t unsafely reassign such an object back to any Julia reference yet, nor can we hack its type’s instantiation to readopt such an object. If the point is to allocate memory in Julia for use in other languages, then this would be fine. The problem would indeed be abstractly moving data across a language barrier without interference from the original language’s automatic memory management. Incidentally, “untracking” reminds me of Rust’s ManuallyDrop or std::mem::forget, but I also don’t know if Julia’s GC implementation can allow a parallel.

Bear in mind that a non-isbits type doesn’t necessarily put all its data on the heap either. An immutable struct often stores some fields on the stack (which is also how it can be stored inline a separate heap allocation). A mutable struct would usually end up on the heap for memory-safety, but the compiler may put it on the stack if it conservatively proved method lifetime on the Julia side; Julia really doesn’t have semantics that map perfectly to kinds of allocation. We can work around heap-to-stack optimizations with global caches, but we have to copy some data from immutable struct instances over to the other language’s side.

Julia does have an interface to register “external” object with the GC and to write a custom scanner for them.

So I think you could write a C interface that exposes a new foreign type that wraps your queue.
Then during GC a callback is invoke that let’s you traverse your object and queue any Julia object that you are holding onto.

This is using a very low-level interface so there isn’t that much documentation, there was a detailed Julep Juleps/GcExtensions.md at master · JuliaLang/Juleps · GitHub and it is used by GAP.

5 Likes

Thanks everyone for these responses! Will look at + process them soon in more detail. For now, posting some very basic perf results (enqueue only, for a fixed size) for those interested. More complete test results forthcoming.

Testing MPMCQueue
  18.006 ms (0 allocations: 0 bytes)
Testing NativeMPMCQueue
  256.020 ms (0 allocations: 0 bytes)
Testing DualLinkedConcurrentRingQueue
  899.966 ms (20 allocations: 640.02 MiB)
import Chairmarks: @b
import ConcurrentCollections
import OntonCollections

...

mutable struct TestElement
    v::Int64
end

function perf_test_queue(q, n, refs)
    for i in 1:n
        push!(q, @inbounds refs[i])
    end
end

function perf_test_queues()
    Concurrent.with_pin() do # thread pinning
        n = 1048576
    
        # Preallocate refs to avoid allocation overhead during benchmark + preserve for GC
        refs = [TestElement(i) for i in 1:n]
    
        # I implemented `OntonCollections.MPMCQueue` as a Julia port of https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
        println("Testing MPMCQueue")
        q1 = OntonCollections.MPMCQueue{TestElement}(n)
        b1 = @btime perf_test_queue($q1, $n, $refs)
    
        # This is moodycamel::ConcurrentQueue
        println("Testing NativeMPMCQueue") 
        q2 = OntonCollections.NativeMPMCQueue{TestElement}()
        b2 = @btime perf_test_queue($q2, $n, $refs)
    
        println("Testing DualLinkedConcurrentRingQueue")
        q3 = ConcurrentCollections.DualLinkedConcurrentRingQueue{TestElement}(log2ringsize=Int(log2(n)))
        b3 = @btime perf_test_queue($q3, $n, $refs)
    end
end
2 Likes

I’m not sure if it’s going to be better but JavaList/CDLL in WorkstealingQueues.jl/src/CDLL.jl at main · gbaraldi/WorkstealingQueues.jl · GitHub has performed quite well for me

@sgaure It’s a good thought about the recursive GC trace being potentially more expensive than inserting into a global structure. Unfortunate — I hope it’s not the case, for the sake of not requiring a port (thereby potentially introducing subtle bugs) in order to use battle-tested high-performance data structures in Julia.

@Oscar_Smith @j_u See benchmarks above — more forthcoming! Good observation about free races — didn’t think about that.

@Benny Good thoughts! A Julia analogue of std::mem::forget is pretty much exactly what I wish existed. But for now, other workarounds seem to be necessary.

@vchuravy Very helpful — thank you! This might be just the solution. I’ll explore it! (I know you alluded to this @sgaure)

@gbaraldi Thanks for the suggestion!! Ostensibly CDLL.jl will be quite a bit slower due to its linked structure as opposed to a preallocated array structure. But for unbounded queues, it’s of course a good solution.

TBH, on balance, despite its cons, porting is probably the best option given the relative effectiveness of Claude et al. And we can already see in the preliminary results that simply porting BoundedMPMCQueue from C to Julia ends up being >14x more performant (on enqueue, anyway) than moodycamel::ConcurrentQueue plus the very light Julia wrapper.

3 Likes

I would be very careful with trusting Claude or any other AI agent with correctly handling the details of lock free programming. Memory models are notoriously complex and the subtle details can easily break correctness.

4 Likes

It’s a very good point @Oscar_Smith. I attempted to port three different relatively simple lock-free queue implementations from C/C++ to Julia using Cursor and only one of them succeeded. For the other ports, Claude got stuck in a self-modification loop that it could never seem to emerge from. Good tests (and large numbers of iterations to evince race conditions) helped with confidence as to correctness. Formal verification is better, but I’m not familiar with those tools as they relate to lock-free algorithms.

1 Like

I’m not sure this will be faster than the thin wrapper used now? You need a root scanner callback which adds those objects to the root list. They must be added when the object is in the foreign queue. So one must keep track of them, put them in a list when entering the queue, take them out when they are dequeued. Which is exactly what the thin wrapper does.

@sgaure — the thin wrapper used now (shown below) only uses scoped GC preservation via GC.@preserve; the perf tests show a manual interprocedural GC preservation approach (just append to a vector). The alternatives were going to be either 1) do a deep conversion between Julia structs and C structs (Julia → C on enqueue; C → Julia on dequeue) or 2) put the Julia struct into a Julia data structure on enqueue and take it out on dequeue. You make a good point — it remains to be experimentally seen whether e.g. ConcurrentDict is faster vs. adding to root list. Without knowing almost anything about Julia GC internals, I’d probably put my money on root list perf.


"""
    offer!(q::NativeMPMCQueue{T}, v::T)::Bool where {T}

Tries to enqueue `v` to `q`. Returns `true` on success, `false` on failure. Like Java's `offer`.

!!! warning
    The caller must maintain a reference to `v` until it is dequeued, otherwise it may be
    garbage collected while still in the queue.
"""
function offer!(q::NativeMPMCQueue{T}, v::T)::Bool where {T}
    if v === nothing
        throw(ArgumentError("Can't enqueue `nothing`"))
    end

    if q.handle == C_NULL
        throw(ArgumentError("Queue has been finalized"))
    end

    GC.@preserve v begin # NOTE the lifetime of this is too short
        vptr = Base.pointer_from_objref(v)
        result = @ccall Libonton.path.moodycamel_cq_enqueue(
            q.handle::Ptr{Cvoid},
            vptr::Ptr{Cvoid}
        )::Cint

        return result == 1
    end
end

Just a minor tip. In such a thin wrapper it might be wise to “hide” the throw, otherwise they may be inlined and bloat the machine code, preventing some optimizations.

    if v === nothing
        ( () -> throw(ArgumentError("Can't enqueue `nothing`")) )()
    end

    if q.handle == C_NULL
        ( () -> throw(ArgumentError("Queue has been finalized")) )()
    end

Compare the outputs of these:

function f(i)
    if i < 0
        throw(ArgumentError("i < 0"))
    end
    i+1
end

function g(i)
    if i < 0
        (()->throw(ArgumentError("i < 0")))()
    end
    i+1
end
julia> @code_native f(42)
julia> @code_native g(42)

You need a root scanner callback which adds those objects to the root list. They must be added when the object is in the foreign queue.

You are trading a constant runtime overhead “put thing into IdDict” vs overhead during GC.
Also don’t use the root_scanner_callback but rather a foreign type with a custom mark function.

In reality this means that during GC you have to simply walk the items in the list and enqueue them.

But you must also put them in the list, and take them out. Concurrently. Which is, more or less, ConcurrentDict.

1 Like

Great point @sgaure! I missed those throw calls when refactoring our codebase to use ex (essentially error with some additional features, and @noinline to address this very issue you bring up.)


@vchuravy so it’s O(n) on the size of the queue on every GC run?


@vchuravy because it’s a concurrent queue, it would probably need to be ConcurrentDict or equivalent, no? I don’t see a way that it could safely use an IdDict, whether global or task-local. I’m very curious what the relative overhead is as between ConcurrentDict (or ostensible ConcurrentIdDict) and GC. Either way I wonder if it defeats the purpose of using a native data structure and forces our hand into porting the native code to Julia.

Perhaps a custom marker could loop through the actual C-queue? I.e. keep a foreign object representing the c-queue as a global, and its gc-marker calls into the C-queue code and marks all the objects in there.