I have a Julia interface to a C library where each C struct represents a number. In Julia, the mutable struct just contains a Ptr to the C struct, and the finalizer calls the C destructor.
These C structs can be very memory intensive (they represent Taylor series), but I’ve noticed Julia’s GC does not seem aware of how much memory is actually being used by the C code.
For example, if I call the function:
function foo(x)
for i = 1:10000
# operations using the C number type
end
end
while keeping an eye on top command on my Mac command line, Julia’s memory usage can skyrocket to 5+ GB, and then if I call it again go up to 10+GB, etc until 30+ GB (I’m on a Mac M2 Ultra).
If I instead call the function
function foo(x)
for i = 1:10000
# operations using the C number type
GC.gc(false)
end
end
there is no skyrocketing of the memory, and in fact, the speed is basically unchanged.
So my questions are
Why is this happening? Is Julia’s GC not aware of how much memory is actually being used by the C code?
If it is unaware, then how can I tell the GC at object creation in a performant way how much memory is actually being used? If it is aware, then how can I fix this?
Julia is not aware how much memory is behind any given pointer (because it’s just an opaque pointer probably allocated by some other memory allocator). But it should be aware of the overall system memory pressure (including the things allocated by your C library) and run GC when that is tight.
I believe the Julia GC can only run when an allocation occurs (or when called via GC.gc). If you aren’t making any Julia allocations in your code then there may not be an opportunity for the GC to trigger.
You have to allocate the memory in Julia and pass a pointer to it to C.
E.g.
mutable struct TheCtype
x::Cint
end
function do_something(ct::TheCtype)
p = pointer_from_objref(ct)
GC.@preserve ct begin
@ccall my_c_function(p::Ptr{Cvoid})::Cvoid
end
end
On my smaller laptop, the OS itself will actually kill Julia unless I include GC.gc(false).
On the Mac, things will get extremely slow until eventually around 115 GB memory usage the garbage collector is run, which then takes a significant amount of time too. Shouldn’t it be run long before things get so slow?
There’s a few things here that make it hard to help - what version of julia are you using? Is the high memory usage you’re seeing from just the julia process, or overall utilization? How much of this is actually in use vs. just reserved memory (check RSS memory utilization)?
Ideally, could you provide a small reproducible example that we could run on our machines?
Also, you mention the use of finalizer to free the object. This is not generally a good idea - finalizers are not guaranteed to run immediately, so they may cause your objects to stay alive for much longer than necessary.
If what you’re doing is trying to emulate RAII - there are better ways to do that.
The 115 GB is what is shown under MEM from top for only the julia process. I’m not exactly sure how to check if its RSS or not but I’ll figure that out
Absolutely! Here you go
import Pkg
Pkg.add("GTPSA")
using GTPSA
d1 = Descriptor(6,10)
r = vars()
iter = 20000
function test!(r, iter)
for i=1:iter
normL = 1/sqrt( (1+r[6])^2- r[2]^2 - r[4]^2)
r[5] = r[5] + normL * (1 + r[6]) - 1
r[1] = r[1] + normL * r[2]
r[3] = r[3] + normL * r[4]
end
end
function testgc!(r, iter)
for i=1:iter
normL = 1/sqrt( (1+r[6])^2- r[2]^2 - r[4]^2)
r[5] = r[5] + normL * (1 + r[6]) - 1
r[1] = r[1] + normL * r[2]
r[3] = r[3] + normL * r[4]
GC.gc(false)
end
end
test!(r, iter)
testgc!(r, iter)
Thank you! The allocations are due to your TPS object having to live on the heap due to your finalizer:
julia> using AllocCheck
julia> try
test!(r, iter)
catch e
e.errors[1]
end
Allocation of TPS in /home/sukera/.julia/packages/GTPSA/YxRfI/src/tps.jl:5
| t = new(t1)
julia> try
test!(r, iter)
catch e
e.errors[2]
end
Allocating runtime call to "jl_f_finalizer" in ./gcutils.jl:87
| Core.finalizer(f, o)
The allocation can’t be elided because of the unknown lifetime. Since finalizers are not guaranteed to run as soon as the object is “dead”, they accumulate until the next GC run occurs. If I simply insert a GC.gc():
The continouos high memory usage disappears. Note that “high memory usage” is not always a bad thing - just having “empty” memory around is not magically going to make code more performant.
Thanks for your checks! Yes, if you put GC.gc(false) at the end of the loop then there is no overload of memory usage: on my small laptop the OS does not kill Julia, and on my Mac things do not get really slow after many runs.
My point is, shouldn’t Julia’s GC be able to detect this and run the GC more frequently so that this overload of memory and subsequent either killing by the OS or significant slow down does not occur?
Note that GC.gc() is distinct from GC.gc(false); the former runs GC manually while the latter disables it entirely (after performing a collection, I believe).
You haven’t mentioned a slowdown so far - if it’s getting slower because of that, that’s of course worth looking into. I suspect that any slowdown that does occur because of this (in your particular case) would be explained by having the OS swap out to disk? That’s something you’ll have to investigate though.
Generally, Julia tries to run GC not that often, because empty memory is (more or less) “free real estate”. Garbage collection runs are relatively slow and getting new memory is relatively quick. In recent versions (I don’t know the exact one OTOH, sorry) you can give --heap-size-hint to force GC to occur once a threshold is reached, which might be beneficial in your case.
You can use jl_malloc/jl_calloc/jl_free to allocate memory in a way that is visible to the GC. You could set a function pointer in your library to these functions, and then use that function pointer.
This is interesting. The library uses its own special allocator instead of malloc by default (uses a thread-safe pool of memory), but I’d like to try jl_malloc since it actually might be faster in this case. I haven’t worked with function pointers in C before, could you point me to some documentation or give a simple example of how to change my malloc to jl_malloc? Do I need to include and link against any Julia library?
If the library doesn’t provide a way to use a custom malloc, it’ll be difficult to just drop jl_malloc in there.
One thing I’m wondering about, looking at the struct definition, is why you’re wrapping a pointer? Ordinarily, I’d make the RTPSA struct mutable (to communicate to julia that it has a stable address) and just use that directly, perhaps attaching the finalizer there, retrieving the object pointer through pointer_from_objref (which should be safe here, since the object is alive in the finalizer). Why the additional indirection?
There is an option to use standard malloc, calloc, free etc, so I could drop in the jl_malloc in there. Then I would just have to include libjulia and link against it, right?
The RTPSA is initialized entirely in the C code, and the constructors just return pointers to them. C owns all of the memory here. Every operation/function takes in pointers to RTPSA in the C library, so doesn’t it make sense to just have the Julia side handle moving the pointers around properly? If I instead use only the RTPSA struct, then at each RTPSA construction I’d have to do an unsafe_load and each operation a pointer_to_objref for each. And even still, Julia doesn’t know how long the coef array is in RTPSA, which is the most memory intensive array in the struct here.
Julia itself already loads libjulia, so that should just work if you’re calling the library from julia.
What I’m saying is that the pointer is the object itself; Julia just abstracts that detail away. Mutable structs in Julia have pointer identity. In that way, Julia already handles the pointer correctly for you and you don’t have to emulate that manually.
You don’t have to do that manually though - the ccall interface handles all of that for you. See here. You’d implement Base.unsafe_convert for RTPSA to give pointer_from_objref and then you can just pass the object as-is back to C.
Julia doesn’t need to know that though - your struct would be exactly the same, pointer fields and all. If you want to inspect the array you can implement some accessing sugar to handle the unsafe_load for you (or use unsafe_wrap with own=false for printing), but other than that, you don’t need to change anything about your struct.
The advantage of this approach lies mainly in the fact that there are no allocations at all anymore. The memory was already allocated by C, there’s no wrapper and since every struct is freed in the same way, the function passed to finalizer can also just be a regular function instead of an anonymous function.
OK I might be convinced… I just have two questions:
When I construct an RTPSA, the C constructor will give a pointer Ptr{RTPSA}. If I understand correctly, with this pointer I then unsafe_load into my RTPSA struct. Now when the finalizer is eventually called, I pass it pointer_from_objref for that RTPSA. This pointer is different from the pointer originally returned by the C code during construction. Do they both point to the exact same place in memory, and is the memory properly freed on the C side when receiving this different pointer?
I cannot compile the C to a shared library with these undefined symbols jl_free and jl_malloc. So I’m not exactly sure what you mean by it will “just work”?