I really like how CUDA.jl mixes memory management strategies: Memory is garbage collected, as in normal julia. But, optionally and on user / programmer discretion, you can manually release objects via CUDA.unsafe_free!.
I want that in Base julia: If you know that something is not needed anymore, then permit manual freeing.
The main aim is manual freeing of large temporary arrays. Sure, reduce > reuse > recycle, but let’s not make the perfect the enemy of the good.
What do you think about that?
I’ll later add some more details on how I’d imagine the API to look like:
Compile-flags or command-line flags to disregard manual frees (like we have for bounds-checking! Important for debugging)?
Maybe two separate functions, Base.unsafe_free! for best-effort mostly-safe freeing (does nothing for unsupported types) and Base.must_free!(item::Any, recurse::Bool) that throws if it cannot free the item (important for people who want to disable GC alltogether, due to soft realtime-ish use-cases).
Another point when freeing an array: We need to free the underlying memory. Then we probably want to zero the array’s pointer, in order to turn a potential memory corruption into a mere nullpointer deref. We should do that for e.g. Matrix as well (even if we can still get a memory corruption – the pointer is considered constant, so it may have LICMed up the callstack).
A difficult point is arrays with shared data that are not bitstype (i.e.multiple arrays pointing to the same data). Can we prevent the GC from corrupting memory then? The memory we’re freeing is still reachable for the GC, it’s just that the user promises that they won’t access it anymore (fingers crossed!).
However, I’m most concerned about largish (16+ MB) bitstype arrays (which tend to come from libc / mmap / kernel instead of julia’s memory pools).
For the purpose of large-ish temporary arrays, I’d much rather have escape analysis work better on Memory than introduce more unsafeness. If you want to handle pointers/lifetimes manually, you can already do that by calling Libc.malloc/Libc.free and wrapping your own array struct, ala the various PtrArray types that float around in the ecosystem.
You could even just use unsafe_wrap to wrap the malloced pointer although this does allocate itself. If you say own = false it becomes your responsibility to free the memory.
Some niche hurdles with allocations would be Task and Sockets. Most functions in here allocates. It is easy to circumvent GC in numerical code but as you start to interact with underlying system packages more and more, it gets harder to eliminate the allocation. A more deterministic behavior of allocations in these situations would be really nice.
This Base function is responsible for building a BitVector from an iterable (of Bool). I will quote it in full:
# The aux functions gen_bitarray_from_itr and fill_bitarray_from_itr! both
# use a Vector{Bool} cache for performance reasons
function gen_bitarray_from_itr(itr)
B = empty!(BitVector(undef, bitcache_size))
C = Vector{Bool}(undef, bitcache_size)
Bc = B.chunks
ind = 1
cind = 1
y = iterate(itr)
while y !== nothing
x, st = y
@inbounds C[ind] = x
ind += 1
if ind > bitcache_size
resize!(B, length(B) + bitcache_size)
dumpbitcache(Bc, cind, C)
cind += bitcache_chunks
ind = 1
end
y = iterate(itr, st)
end
if ind > 1
@inbounds C[ind:bitcache_size] .= false
resize!(B, length(B) + ind - 1)
dumpbitcache(Bc, cind, C)
end
return B
end
So we allocate a 4096 byte Vector{Bool}, fill it with iterator values, and then use a pretty hand-crafted function dumpbitcache to compress the Bool down to bits (8x compression).
Once we’re done… we leave the 4kb Vector{Bool} for the garbage collector.
If we had good language support for that, we could unsafe_free!(C). Literally one line, no bigger mental / safety overhead than @inbounds or @simd ivdep. And we’d reduce GC pressure by 4 kb. Not essential, but small wins are also wins.
Will escape analysis succeed here? Kinda unlikely, unless dumpbitcache is inlined (but we don’t want that inlined – it is an expensive and rare call, and we really want iterate inlined instead).
I don’t really care for the container, i.e. the handful of bytes in the Vector – I primarily care about the backing memory, i.e. the 4kb.
We could resize!(C,0) to maybe get the same effect, but I’d rather have an explicit function that allows different / better implementations in the future.
This should probably be modified to use a Memory{Bool} now instead of a Vector, since the size never changes and is a const global anyway. That might already be an improvement.
It should also be noted that quite a lot of that code is pretty old by now, so it’s not unthinkable that just writing the same code with newer patterns (like Memory) that we now have makes all of those handcrafted ones obsolete.
That’s what I mean with “improving escape analysis”. For these cases, it’s sufficient to know that C will not escape through dumpbitcache, even if it’s not inlined.
As far as I know, the current escape analysis is already able to figure that out, it’s just not hooked up to every part of the compiler pipeline (see e.g. this PR or this one for some challenges with that).
You can always GC.@preserve the variable and go to town with pointers, if that’s up your alley I’m doubtful you’re going to have an easier time writing that performant code with an unsafe_free though; such a call is very unlikely to cause that C to be stack allocated (which seems to be what you want to achieve, to reduce GC pressure), so I’m a bit skeptical of the approach.
I don’t want it stack allocated. I want good old malloc/free management where appropriate, with a GC that can pick up the inevitable memory leaks. And let’s face it, manual free is mostly worth it for largish allocations.
I want mostly the same effect as resize!(C,0): The backing memory is free’d, julia’s memory management knows that it has 4kb less “new possible garbage allocated bytes” / 4kb more “free space to serve allocs from”. GC needs to run less often – this single free is equivalent to avoiding 150 small short-lived allocations, in term of pressure.
Sure, stack allocation is even better when appropriate. Sure, malloc / free still cost cycles (which we’re paying anyway, first during allocation and then during the sweep phase of GC).
In e.g. JVM, my request would make no sense for normal heap allocated stuff. That is because JVM has a moving/evacuating GC with bump allocator – it is absolutely useless for relieving GC pressure to free something in a page unless the entire page is evacuated. (this would make sense for DirectBuffer stuff, though!)
I would cast a vote in favor of an unsafe_free! mechanism. While escape analysis is a big win for memory usage reduction, it’s not infallible, and will always have situations where it fails to find the static lifetime of an allocation.
unsafe_free! would be an escape hatch for when escape analysis fails, allowing our ecosystem to move closer towards supporting memory-constrained use cases without putting as heavy a reliance on the compiler, when library authors already have the information they need to communicate strict allocation lifetimes.
Another interesting possibility is what optimization opportunities are unlocked when unsafe_free! ends the lifetime of an allocation - there could be license for the compiler to remove unsafe_free! and instead forward the targeted object to the next allocation point, effectively allowing compiler-driven buffer reuse.
I would also like an unsafe_free!, and/or explicit stack allocation. I often find that temporary arrays fill up memory and cause GC. It’s particularly annoying in parallel processing, where GC is devastating for performance. So one must resort to various ways to remedy this, either the fortran way of allocating everything in advance, caching buffers in TLS, in channels, using malloc, wrap, whatever. It’s a frequent source of clutter in programs.
I worry that unsafe_free! will bring use-after-free to Julia, much more often than it happens now through mismanagement of GC.@preserve.
History has shown that programmers are not fit for the job of lifetime management of objects. The numerous CVEs & vulnerabilities year after year because of memory safety issues C causes are testament to that.
The prevalence of CVEs for C code is quite possibly because manual deallocation is the only mechanism to free memory - when users are forced to always manage memory lifetimes, then of course they’ll sometimes get it wrong, and the chance of the user being wrong is thus multiplied by every heap allocation written in any C program.
What’s being proposed here is the opposite configuration: the vast majority of allocations will be managed by the GC, but a very small number of allocations might be manually freed.
Will bugs happen because of incorrect lifetime reasoning by library authors? Most certainly! But those sorts of bugs already happen with GC.@preserve and associated pointer wrangling. Yet, without those tools, we’d never be able to have:
GPU integration
Callbacks run on foreign threads/event loops
Integration with a variety of non-Julia libraries that communicate data via pointer
Similarly, without a feature like unsafe_free!, our ecosystem will not be able (on any reasonable time scale) to:
Implement algorithms which clean up after themselves to avoid GC pressure
Avoid GC pauses for known “static” code
Compile generic code reliably to GPUs/MCUs/other accelerators/runtimes
Avoid OOMs due to GPU GC pressure
Could we solve all of the above with really, really good escape analysis and finalizer inlining? Yeah, definitely! But, there are both not enough people who understand how to implement those analyses (and maintain them!), and too many edge cases and problematic code patterns to support. It’s just not something that we can reliably do, right now.
Of course, there’s no reason that unsafe_free! has to exist forever as a wart on our language - once Julia’s own compiler analyses are sufficiently powerful for most uses of unsafe_free!, we can start warning users away from it, and submitting PRs to packages to remove its use. Or, replace it with another safer mechanism like borrow checking or the like.
I don’t love that unsafe_free! seems like a necessity right now - I would love if arrays would just free themselves rapidly and leave my memory alone! But that is not the state of the Julia right now, and we have fast, low-memory code to write today! So let’s implement this “hack” for now, see how we can limit its harm (through social and/or technical solutions), and then re-evaluate once the compiler is really up to the task of freeing memory in a timely manner.
I have yet to see a single use-after-free in Julia code being caused by GC.@preserve. Quite the opposite, in fact - it is a lack of proper GC.@preserve that causes them!
I don’t see why any of those use cases require unsafe_free!. All of those are, to the best of my knowledge, already working & well supported.
That list seems unnecessarily hyperbolic to me.
Clean up how, exactly? The only way that unsafe_free! could clean up is by forcefully removing an object from the GC pool it’s currently tracked in, requiring the expensive walking of that pool to shuffle the other objects being tracked by that pool. If you want objects passed to unsafe_free! to not be tracked through GC at all, then you also need an escape hatch to allocate these objects. We already have that pair though: Libc.malloc and Libc.free.
GC pauses in your loops won’t be prevented by sprinkling some unsafe_free! onto your buffers, some user code on another task may trigger GC collection far from your code too.
Both of us have done that exact thing, so I don’t see it as an argument for unsafe_free!.
Again, how can unsafe_free! manage that? Are you saying that any object that may be passed to unsafe_free! should not be tracked by the GC, through some form of backpropagated “may be passed to unsafe_free!, allocate it somewhere else, but not with GC”? What if that’s not always the final place the object passes through (memory leak, hello!)? What if that’s not easily inferable? How is this different from calling Libc.malloc & Libc.free manually?
We have a case study for that sort of removal, --check-bounds=no, and it’s not going well:
So I’m VERY doubtful that an “eventual weaning off of unsafe_free!” would go over differently.
I do NOT think that supporting a short-term hack is worth the long-term pain this will cause. I linked #52382 up above for a reason, and that is to show that the work making hoisting of Memory to the stack (thereby achieving all of those same goals you posted above, but in a SAFE way) is actively under way, as recently as last december.
Does this mean it will happen soon? Of course not! But suggesting that “just use unsafe_free! for now” is a good solution is IMO missing the forest for the trees. We already have Libc.malloc/Libc.free. Those tools are incredibly unwieldy, yes, and that’s precisely because they don’t give any of the safety guarantees that regular Julia code gives.
Fundamentally, unsafe_free! introduces an additional escape hatch in safe Julia code that can suddenly make any interaction with objects being passed to it unsafe. Objects that may interact with it are as gloves-off of a boxing match as handling of raw Ptr objects. If you want that, more power to you, but please do it in the confines of GC.@preserve, and not by introducing the equivalent of Libc.free to any arbitrary Julia object.
As I said initially, my primary concern are large arrays. Or rather, in the new world, Memory instances with large backing memoy. Large allocs get passed through to malloc anyway; and at least glibc directly passes them through to the kernel / mmap. There should be no expensive pool walking necessary: Check whether the allocation is libc-based, and if this is the case then zero out the pointer in Memory and call libc free. (jl_gc_free_memory would need a nullpointer check as well)
Indeed, #define GC_MAX_SZCLASS (2032-sizeof(void*)), so already 2kb arrays are always coming from libc.
Memory.ptr is const, but that simply means that not all UAF lead to nullpointer-deref/segfault, some might also corrupt memory in very rare conditions.
Consider my example from Base. If the 4kb temp was coming from unmanaged malloc, then we’d need a try / finally block to protect cleanup from exceptions in the iterator.
In my proposal, we could just let GC handle the rare exceptional case.
Apart from the coding difficulties, try-blocks are not free, since julia is using setjmp/longjmp for exception handling.
PS. Apart from 2kb+ short-lived temps, I’m also concerned with 100MB+ probably-old-generation arrays.
Are there issues with solving this class of problem with Bumper.jl? I haven’t used it, but it seems specifically designed to deal with this category of temporary allocation.
Bumper.jl is probably the best idea for explicit memory management. It does make the mental load of managing many allocations much easier since you only have to check which underlying scope you are going to return to some allocated result and use that scope’s buffer. It does have limitations thought listed in the readme.
The problem is it’s not integrated into the ecosystem of Julia. It does not use regular Julia arrays so you will encounter issues. It is also used in StaticTools.jl with malloced memory backed arrays, where I have seen worse performances in some instances than StrideArrays.jl backed buffers or just simple Julia arrays. Malloc arrays needs better information for LLVM to optimize better which lacks.
I really do wish Bumper.jl gets adapted as an allocator interface where you can set an buffer at the start of the a scope and all the allocation in that scope are allocated on that buffer(i.e. arena allocators in Odin and zig).
BTW my AllocArrays.jl tries to fake that via dispatch. If you can control what’s getting passed in (and you pass an AllocArray), and the allocations are happening via similar (in generic code that tries to match the type you provided), then it can bump-allocate those allocations. In the README I show it taking cpu inference with a simple Flux model from 300 MB to 3 MB of allocations.
I like what Bumper is doing. But there are various limitations. For example, consider:
julia> function foo(t)
try
@no_escape begin
z = @alloc(Int, 10)
t && throw("oops")
nothing
end
catch e
end
nothing
end
julia> @show Bumper.default_buffer(); foo(false); @show Bumper.default_buffer();
Bumper.default_buffer() = SlabBuffer{1048576}(Ptr{Nothing} @0x0000000003054290, Ptr{Nothing} @0x0000000003154240, Ptr{Nothing}[Ptr{Nothing} @0x0000000003054240], Ptr{Nothing}[])
Bumper.default_buffer() = SlabBuffer{1048576}(Ptr{Nothing} @0x0000000003054290, Ptr{Nothing} @0x0000000003154240, Ptr{Nothing}[Ptr{Nothing} @0x0000000003054240], Ptr{Nothing}[])
julia> @show Bumper.default_buffer(); foo(true); @show Bumper.default_buffer();
Bumper.default_buffer() = SlabBuffer{1048576}(Ptr{Nothing} @0x0000000003054290, Ptr{Nothing} @0x0000000003154240, Ptr{Nothing}[Ptr{Nothing} @0x0000000003054240], Ptr{Nothing}[])
Bumper.default_buffer() = SlabBuffer{1048576}(Ptr{Nothing} @0x00000000030542e0, Ptr{Nothing} @0x0000000003154240, Ptr{Nothing}[Ptr{Nothing} @0x0000000003054240], Ptr{Nothing}[])
We see a similar issue as with explicit malloc/free: Exceptions lead to memory leaks that GC cannot ever clean up. try/finally blocks are not the answer for performance critical code, since setjmp is not free, both in terms of cycles and in terms of forgone compiler optimizations (and in terms of weird llvm bugs that returns_twice functions can trigger). It’s less bad than malloc memory leaks, because it will heal when the task finishes or when we hit the next checkpoint_restore! up in the callstack.
I would not consider that as a bug in Bumper.jl, but maybe @Mason could document that limitation more explicitly.