Manually Trigger Mark Phase of GC

Hello,

I was wondering if there is a way to manually trigger the mark phase of the garbage collector to manually call finalizers on objects which are marked as unreachable?

I have lots of foreign objects owned by C++ and Julia does not get rid of the objects fast enough for the wrapped library to work as intended (OOM and other issues). The structs must be mutable because they hold a pointer back to C++ so its all heap allocated and managed by the GC. Each object is essentially just a std::shared_ptr to some allocation, but that shared_ptr is not decremented cause Julia doesn’t bother to run the GC. The Python wrapper for this library “just works” because Python uses reference counting to do GC.

Is this just something Julia cannot do? It would be nice if there was some kind of exit analysis or way to query if objects could be GC’ed and just have not yet. I know CUDA.jl basically re-implements garbage collection to circumvent this, but memory pressure is not the only problem I need to avoid (its a good start though). Just the fact that the objects appear alive to the C++ library causes performance issues.

Are you looking for GC.gc()?

help?> GC.gc
  GC.gc([full=true])

  Perform garbage collection. The argument full determines the kind of collection:
  a full collection (default) traverses all live objects (i.e. full mark) and
  should reclaim memory from all unreachable objects. An incremental collection
  only reclaims memory from young objects which are not reachable.

  The GC may decide to perform a full collection even if an incremental collection
  was requested.

  │ Warning
  │
  │  Excessive use will likely lead to poor performance.

Manually calling that does fix this issue, but it is incredibly slow and also not something the end user should have to call in their code. CUDA.jl has a heuristic for automatically calling GC.gc() based on memory pressure but that’s a couple hundred lines of code and lots of careful engineering just to manage the lifetime of their GPU Arrays. With how much of the Julia ecosystem is built on top of foreign objects what would it take to get better support for managing their lifetimes? For foreign objects ref-counting makes more sense in my opinion, but I do not think that is an easy lift to add or strictly possible. Some kind of @foreign macro that wraps your type with some kind of ref-counted object that Julia’s GC handles differently?

The problem code I have is basically this:

for i in range(large_number)
    handle = new_resource()
    .... create a bunch of views into the handle
    .... do stuff with views
end

so lots of arrays are spawned which leaks memory and eventually OOM’s.

It does? CuArray instances are documented to be handled normally by Julia’s garbage collector. It’s the underlying CUDA memory pools for optimizing future allocations that get reclaimed via a memory pressure heuristic, separately from GC.gc.

Maybe I misunderstand I was just reading their source code. I was mostly looking at this function which is the one calling GC.gc.

If that’s for memory pools, how does CUDA possibly enforce that Julia’s GC acts often enough? A CuArray is more or less just a fancy wrapper for a pointer. Julia still has no idea how much memory a CuArray takes up and therefore OOM would still be possible no?

1 Like

Okay I get it now, you weren’t talking about the lifetime of CuArrays, rather the device memory allocations. Device memory can get used up even when host memory doesn’t need to spur the garbage collector, so CUDA.jl’s reclamation (there might be a more proper term, I’m just calling it this to differentiate it from Julia’s garbage collection and because the function is called CUDA.reclaim) indeed needs to run GC.gc() because uncollected CuArrays prevent reclamation of their device memory. In the code you linked, maybe_collect is a preparatory step in pool_alloc.

In your case though, it doesn’t sound like you have another memory region to pressure and trigger memory management; you just want Julia to manage memory allocated in another language, to the point you’re allocating mutables on the Julia side in order to use finalizers to free the C++ objects. I can’t remember the Github issue for it, but memory-managing finalizers inherently relying on Julia’s allocations and memory pressure is still a problem for interop. I did however find a thread about switching to jl_malloc /jl_calloc /jl_free in C code to let Julia’s GC see it.

1 Like

Yeah I’ve seen that thread when poking around. We unfortunately cannot set custom allocators with the library we are wrapping. There is technically a memory pool but it is managed by the external library and not Julia whereas in CUDA the memory pool is managed by Julia (even though its really also a C object).

I don’t really want Julia to manage the memory of another language, because C++ destructors are doing that already. It’s just Julia’s GC actively hinders interoperability with other languages by delaying when that memory can be handled by the external language/library. C++/C and Julia will always fundamentally differ in this way, but I feel there should be a solution here other than hoping the GC will be fast enough.

Typical GCs delay when memory is freed, that’s how they maintain decent throughput despite the busy work. Like the Julia wrappers you’re using, unreachable Julia objects on the heap in general stick around longer than they strictly need to, waiting for a GC cycle. If there was a ready automatic way to identify and free them more quickly, the GC would already be doing it for you.

Manually triggering the GC was mentioned earlier, but performance obviously suffers if you run a cycle so often it’s no longer delaying frees. Manual frees have cropped up in very niche data structures (StaticTools.free(::MallocArray), CUDA.unsafe_free!(::CuArray)), but there isn’t a working API for Julia objects in general, and it’d sacrifice safety and isn’t guaranteed to have better throughput than a GC. You can manually finalize before an object becomes unreachable, but if you are willing to go as far as manual calls AND know a safe spot to put them, then you might as well ditch the wrapper’s mutability and manually call dedicated cleanup API. If you also know when that object is instantiated, you can make a higher order function to do the instantiation and cleanup more easily like open(workonfile, filename), though you still have to avoid unsafe things like caching the instance and using it after cleanup.

Earlier you mentioned that (C)Python’s GC uses more immediate reference counting (partially, it uses generational GC for reference cycles, same delays). Independently developed languages and implementations can have very different approaches, so it’s more coincidence than anything when interop gels somewhere. CPython just happens to be currently using reference counting that works well with that of shared_ptr. CPython is free to lose it in the future like it’s recently losing the infamous GIL; PyPy is a Python implementation without it, so it’s proven possible. Reference counting isn’t universal because of the overheads, and it takes extra work to be safe across multiple threads; the GIL is only going away because reference counting can now save that extra work for single-threaded objects.