Julia GC, heap fragmentation, out of memory, push!/append!

I saw the juliacon22 video from Christine Flood about Julia GC. I’m glad to see that the GC is finally getting some attention and not just the compiler. It’s long overdue.

Let’s just address the elephant in the room: The Julia GC has a memory fragmentation problem. There I said it. I brought this possibility up (I wasn’t sure at that time) on this forum a few times over the years (all the way back since julia 0.2) but this was always brushed aside. The standard answer seems to be: Julia GC doesn’t have a problem, it’s you (i.e. you have a memory leak). My code didn’t, but proving that was pretty impossible due to the lack of tooling to find memory leaks in user code.

Anyway it’s good to see that Christine finally acknowledged the fact that even without a user memory leak Julia can take more and more of host memory until you run out (I’ve experienced this myself). This is of course heap fragmentiation although she never uses that term. Admitting that you have a problem is the start of recovery :slightly_smiling_face:

Unfortunately the immediate things planned seem to be about performance (parallel GC, shorter collect time) rather than fixing fragmentation. Due to the current constraint of non-movable heap memory I’m not sure if this problem can ever be fully solved. To fix it completely you’ll probably have to relax that constraint. Hopefully the future GC work will also try to address (or at least mitigate) fragmentation.

What I do now is adapt my code to avoid fragmentation as much as possible. This is also good for performance. So instead of using push!/append! and only try to use just enough memory as required (like you would in languages like Java/C# with GCs that don’t have fragmentation issues) I preallocate a fixed amount of memory (more than strictly needed sometimes) and initialize everything with zeros. That’s of course an old school trick. That solved an out of memory problem I had in an application.

I feel that in the documentation push! and append! (and similar functions) should have a warning saying that the use of these for large number of elements (tens of thousands) will cause heap fragmentation, slow run time and possibly an out of memory. A warning like “push!/append! considered dangerous” might not be out of place.

Next I’d like to see some tooling that lets you look at the heap structure (like you have for JVM/CLR) and detect where a lot of allocations/deallocations happen from (I believe from one of the presentations there is in the profiler now some info about allocations). Ideally in a graphical way and integrated into the VSCode tooling.

But of course the best solution would be a better GC that does’t take more memory than needed. A region based GC would be sweet!

There is now a new argument to julia for a memory limit. That’s a step in the right direction but it’s not a hard limit so it’s still possible to run out of host memory and crash the host itself. That’s bad for servers! I’d like to see another option that is a hard heap size limit just like for the JVM where it the julia heap would need more than the specified limit it exits with an OutOfMemory exception. So only julia exits cleanly and you host server never crashes. I’ m sure sysadmins will appreciate that :slightly_smiling_face:

2 Likes

Could you elaborate on that a bit? How does using push!/append! lead to fragmentation (of what, exactly)?

Switching to a compacting GC would be awesome, but unfortunately it’s not as simple as one would hope. Ruby changed to one on 3.0 and they faced similar challenges Ruby Garbage Collection Deep Dive: Compaction | Jemma Issroff. There is some work, currently a bit stalled, for a heap snapshot feature like you mentioned.

As such improving our current GC is a more viable mid term solution, and it will help in lots of cases.

1 Like

It’s simple really: instead of starting with a big array preallocated of zeros and then filling in the actual values, you start with a empty array and push/append values to it (in case you do not know before hand how many it will be). If you have a lot of values (at least ten of thousand and depending on how much ram your machine has) the runtime will have to periodically allocate a bigger array, copy over the existing values and deallocate the old array. Now if you GC doesn’t defragment your heap from time to time , after a while your heap will look like swiss cheese and if then you need to allocate a chunk bigger than the biggest contiguous chunk left in the heap , the runtime will ask the OS for more merory. And so the julia process will grow and grow in absolute memory (what you see with your os tooling like top). Eventually your host is going to run out of memory.

Ok, you mean fragmentation in the sense that there’s more and more disjoint allocations (from the OS’ POV) that don’t get freed. I’d count that as a memory leak, not as heap fragmentation (which to me is relevant when talking about chasing pointers).

That I find quite unlikely, since all memory allocated is virtual memory anyway. I assume you mean when you finally run out of SWAP as well (in which case you’d OOM either way).

That said - freeing memory that isn’t needed anymore and reusing the chunks would certainly be beneficial.

No it’s not a memory leak in the strict sense it’s julia heap fragmentation. However it’s an effective memory leak from the OS point of view since all it sees is a julia process that keep on asking more memory.

I meant when your process needs more than available RAM. Yes you could go on using swap space for a while but you definitely don’t want that, your speed would be very slow. Also it’s not a linear progression: at first the GC will be able to cope and free memory and you wan’t see an increase. Everything appears stable and normal but then you see an accelerating increase for memory and larger and larger fluctuations (GC struggling to free enough memory but not quite succeeding) till at the end your process memory is fluctuating very closely to 100 RAM all the time and you process speed grinds to a halt.

Not a solution, but I would point out that languages like C, C++ and Rust do not have any way to defragment their heaps either. Malloc implementations will try to avoid fragmentation, but it can still happen, and in those languages, when it does, it’s on the programmer to figure out how to deal with it.

1 Like

I think move objects in memory would be terrible for integrating Julia with other languages

I get that. I wouldn´t move memory held by native code but maybe for pure julia code, so partially relax the no move constraint

How would that work? Julia already compiles to native code, so regular julia code would have to be treated just the same, no?

I meant memory held by external libs.

Yes but in those languages you know that you’re responsible for memory management. In fact in those languages they want to manage all of that explicitely for full control including allocation/deallocation strategies like only use static memory in C for embedded systems or arena allocators in C++ etc. In higher level languages with GC you usually do not want to think too much about that

Right now in the documentation of Julia (AFAIK) there is nothing that says that the GC doesn’t take care of fragmentation and hence that the user should himself think about that and use for example strategies like preallocate as much as possible. Documenting that would be a good first step to handle the problem.

Note that we are actively collecting GC benchmarks at GitHub - JuliaCI/GCBenchmarks. If you have good problem cases for GC aren’t yet represented in the test cases we would love PRs.

1 Like

When I have some time I’ll try to go back to some old crappy code and extract just the code needed to blow up the memory :slight_smile:

5 Likes

Thanks for bringing it up, but I think it’s a non-issue, or I mean with an alternative allocator a small issue.

It’s time to revisit mimalloc (I looked into linking alternatives to Julia before):

mimalloc is a drop-in replacement for malloc and can be used in other programs without code changes, for example, on dynamically linked ELF-based systems (Linux, BSD, etc.) you can use it as:

> LD_PRELOAD=/usr/lib/libmimalloc.so  myprogram

It also has an easy way to override the default allocator in Windows. Notable aspects of the design include:
[…]

  • free list sharding: instead of one big free list (per size class) we have many smaller lists per “mimalloc page” which reduces fragmentation and increases locality – things that are allocated close in time get allocated close in memory. […]
  • free list multi-sharding: the big idea! Not only do we shard the free list per mimalloc page, but for each page we have multiple free lists.
  • secure: mimalloc can be built in secure mode, adding guard pages, randomized allocation, encrypted free lists, etc. to protect against various heap vulnerabilities. […]
  • bounded: it does not suffer from blowup [1], has bounded worst-case allocation times (wcat), bounded space overhead (~0.2% meta-data, with low internal fragmentation), and has no internal points of contention using only atomic operations.
  • fast: In our benchmarks (see below), mimalloc outperforms other leading allocators (jemalloc, tcmalloc, Hoard, etc), and often uses less memory. A nice property is that it does consistently well over a wide range of benchmarks. There is also good huge OS page support for larger server programs.

The thing is when I tried it, it surprisingly didn’t help, but that’s because each allocation in Julia doesn’t directly lead to malloc in C. Julia does something similar, tries to allocate its own larger blocks (to reduce fragmentation, just likely not in a state-of-the-art way, that seems to be it, and I’m not sure it conflicts with our GC). I didn’t look into bypassing that, but would likely be great, as simplifying Julia’s codebase, and opting into this 8k LOC library (which might be larger than the code we eliminate, but still a win, then not part of our codebase).

Bypassing (eliminating) the Julia code, as also great because then we can try other options too, not just the devs, but also users, choosing the alternatives with LD_PRELOAD above, since:

General memory allocators are interesting as there exists no algorithm that is optimal – for a given allocator one can usually construct a workload where it does not do so well. The goal is thus to find an allocation strategy that performs well over a wide range of benchmarks without suffering from (too much) underperformance in less common situations.

I’m skeptical that’s the solution. Moving stuff around means actual copying (EDIT: usually, not always see MESH allocator in my post below) and CPU/RAM bandwidth overhead, while, yes eliminating fragmentation, but the fragmentation we do have, and will not eliminate completely (just reduce), is in virtual space (mostly), which is free? It used to be a limited resource too (in 32-bit era, we shouldn’t optimize for that), but in 64-bit it’s at least close to free. I’m not sure where the holes in address space end up, if in a swap file, then that’s just stupid. This is a solved problem for sparse files, so I doubt it’s done that way.

From memory (and claim above), that’s likely the best alternative malloc alternatice, but see also my older thread:

As I said, I’m not sure, I didn’t yet read the Ruby … Deep Dive, I guess it does help most far small objects, because of cache issues, not (huge) arrays of small, i.e. matrices Julia is most optimized for. At least I would first look into the simple change I proposed, and as a last resort, compacting, or a hybrid solution. If the small objects would most likely be strings, then those oare object very likely to be passed to or from e.g. C. For interop you can’t use compacting, so it seems then the user would have to predict which objects, e.g. Strings will be passed to C (as CString).

I note from the Deep Dive (outdated?):

It’s important to also note that as of Ruby 3.0, compaction is still under development and actually currently degrades performance on most apps. It’s therefore not automatically turned on.

You might think with the blog post from May 2021 and Ruby 3.0 rather recent they went with good algorithms, something recent, but:

The algorithm Ruby uses for compaction is called a two-finger compaction algorithm [.] actually first written about in “The Programming Language LISP: Its Operation and Applications” by Edmund C. Berkeley and Daniel G. Bobrow in 1964.

i.e. in an era where memory was expensive. So I agree:

It’s quite incredible that this is the algorithm Ruby uses today (57 years later!),

I’m puzzled why, do they target tiny microcontrollers? I thought mostly webserving, and then maybe string processing tilt in their favor over large matrix handling?

In a classic result, Robson showed that all such allocators can
suffer from catastrophic memory fragmentation [30]. This
increase in memory consumption can be as high as the log
of the ratio between the largest and smallest object sizes allo-
cated. For example, for an application that allocates 16-byte
and 128KB objects, it is possible for it to consume 13× more
memory than required.

Surprisingly, I found MESH that can do that (a drop in replacement, and could apply to Julia too): https://raw.githubusercontent.com/plasma-umass/Mesh/master/mesh-pldi19-powers.pdf

Crucially and counterintuitively, Mesh performs com-
paction without relocation; that is, without changing the
addresses of objects. This property is vital for compatibility
with arbitrary C/C++ applications. To achieve this, Mesh
builds on a mechanism which we call meshing
[…]
Our analysis shows that Mesh breaks the above-mentioned Robson worst
case bounds for fragmentation with high probability [30], as
memory reclaimed by meshing is available for use by any
size class. This ability to redistribute memory from one size
class to another enables Mesh to adapt to changes in an ap-
plication’s allocation behavior in a way other segregated-fit
allocators cannot.

We implement Mesh as a library for C/C++ applications
running on Linux or Mac OS X. Mesh interposes on memory
management operations, making it possible to use it without
code changes or recompilation by setting the appropriate
environment variable to load the Mesh library (e.g., export
LD_PRELOAD=libmesh.so on Linux). Our evaluation demon-
strates that our implementation of Mesh is both fast and
efficient in practice. It generally matches the performance of
state-of-the-art allocators while guaranteeing the absence of
catastrophic fragmentation with high probability. In addition,
it occasionally yields substantial space savings: replacing the
standard allocator with Mesh automatically reduces memory
consumption by 16% (Firefox) to 39% (Redis).
[…]
Mesh enables compaction without relocating object addresses;
it depends only on hardware-level virtual memory support,
which is standard on most computing platforms like x86 and
ARM64. Mesh works by finding pairs of pages and merging
them together physically but not virtually: this merging lets
it relinquish physical pages to the OS.

Still mimalloc is always fastest (or close to within 3%) of the other allocators benchmarked there, including MESH, which can be up to 155x slower…

I guess we could run the GCBenchmarks with different mallocs, not sure how easy/hard that would be, but maybe there is some possible benefit there

systemd-oomd does something like this at a system level. My desktop machine never goes into a multi minute thrash freeze anymore.

Imagine your heap has say 10MB chunks each of which has 100kB of randomly located integer arrays… Let’s say you’ve got 1000 of these arrays, so you have 100MB allocated, but in 10GB of virtual address… The OS knows nothing about the fact that these chunks are mostly not in use by Julia… Now try to allocate a Gig of space on your 8G machine with a couple gigs of swap… Boom it’s broken.