Julia GC, heap fragmentation, out of memory, push!/append!

Ok, you mean fragmentation in the sense that there’s more and more disjoint allocations (from the OS’ POV) that don’t get freed. I’d count that as a memory leak, not as heap fragmentation (which to me is relevant when talking about chasing pointers).

That I find quite unlikely, since all memory allocated is virtual memory anyway. I assume you mean when you finally run out of SWAP as well (in which case you’d OOM either way).

That said - freeing memory that isn’t needed anymore and reusing the chunks would certainly be beneficial.

No it’s not a memory leak in the strict sense it’s julia heap fragmentation. However it’s an effective memory leak from the OS point of view since all it sees is a julia process that keep on asking more memory.

I meant when your process needs more than available RAM. Yes you could go on using swap space for a while but you definitely don’t want that, your speed would be very slow. Also it’s not a linear progression: at first the GC will be able to cope and free memory and you wan’t see an increase. Everything appears stable and normal but then you see an accelerating increase for memory and larger and larger fluctuations (GC struggling to free enough memory but not quite succeeding) till at the end your process memory is fluctuating very closely to 100 RAM all the time and you process speed grinds to a halt.

Not a solution, but I would point out that languages like C, C++ and Rust do not have any way to defragment their heaps either. Malloc implementations will try to avoid fragmentation, but it can still happen, and in those languages, when it does, it’s on the programmer to figure out how to deal with it.

1 Like

I think move objects in memory would be terrible for integrating Julia with other languages

I get that. I wouldn´t move memory held by native code but maybe for pure julia code, so partially relax the no move constraint

How would that work? Julia already compiles to native code, so regular julia code would have to be treated just the same, no?

I meant memory held by external libs.

Yes but in those languages you know that you’re responsible for memory management. In fact in those languages they want to manage all of that explicitely for full control including allocation/deallocation strategies like only use static memory in C for embedded systems or arena allocators in C++ etc. In higher level languages with GC you usually do not want to think too much about that

Right now in the documentation of Julia (AFAIK) there is nothing that says that the GC doesn’t take care of fragmentation and hence that the user should himself think about that and use for example strategies like preallocate as much as possible. Documenting that would be a good first step to handle the problem.

Note that we are actively collecting GC benchmarks at GitHub - JuliaCI/GCBenchmarks. If you have good problem cases for GC aren’t yet represented in the test cases we would love PRs.

1 Like

When I have some time I’ll try to go back to some old crappy code and extract just the code needed to blow up the memory :slight_smile:

6 Likes

Thanks for bringing it up, but I think it’s a non-issue, or I mean with an alternative allocator a small issue.

It’s time to revisit mimalloc (I looked into linking alternatives to Julia before):

mimalloc is a drop-in replacement for malloc and can be used in other programs without code changes, for example, on dynamically linked ELF-based systems (Linux, BSD, etc.) you can use it as:

> LD_PRELOAD=/usr/lib/libmimalloc.so  myprogram

It also has an easy way to override the default allocator in Windows. Notable aspects of the design include:
[…]

  • free list sharding: instead of one big free list (per size class) we have many smaller lists per “mimalloc page” which reduces fragmentation and increases locality – things that are allocated close in time get allocated close in memory. […]
  • free list multi-sharding: the big idea! Not only do we shard the free list per mimalloc page, but for each page we have multiple free lists.
  • secure: mimalloc can be built in secure mode, adding guard pages, randomized allocation, encrypted free lists, etc. to protect against various heap vulnerabilities. […]
  • bounded: it does not suffer from blowup [1], has bounded worst-case allocation times (wcat), bounded space overhead (~0.2% meta-data, with low internal fragmentation), and has no internal points of contention using only atomic operations.
  • fast: In our benchmarks (see below), mimalloc outperforms other leading allocators (jemalloc, tcmalloc, Hoard, etc), and often uses less memory. A nice property is that it does consistently well over a wide range of benchmarks. There is also good huge OS page support for larger server programs.

The thing is when I tried it, it surprisingly didn’t help, but that’s because each allocation in Julia doesn’t directly lead to malloc in C. Julia does something similar, tries to allocate its own larger blocks (to reduce fragmentation, just likely not in a state-of-the-art way, that seems to be it, and I’m not sure it conflicts with our GC). I didn’t look into bypassing that, but would likely be great, as simplifying Julia’s codebase, and opting into this 8k LOC library (which might be larger than the code we eliminate, but still a win, then not part of our codebase).

Bypassing (eliminating) the Julia code, as also great because then we can try other options too, not just the devs, but also users, choosing the alternatives with LD_PRELOAD above, since:

General memory allocators are interesting as there exists no algorithm that is optimal – for a given allocator one can usually construct a workload where it does not do so well. The goal is thus to find an allocation strategy that performs well over a wide range of benchmarks without suffering from (too much) underperformance in less common situations.

I’m skeptical that’s the solution. Moving stuff around means actual copying (EDIT: usually, not always see MESH allocator in my post below) and CPU/RAM bandwidth overhead, while, yes eliminating fragmentation, but the fragmentation we do have, and will not eliminate completely (just reduce), is in virtual space (mostly), which is free? It used to be a limited resource too (in 32-bit era, we shouldn’t optimize for that), but in 64-bit it’s at least close to free. I’m not sure where the holes in address space end up, if in a swap file, then that’s just stupid. This is a solved problem for sparse files, so I doubt it’s done that way.

From memory (and claim above), that’s likely the best alternative malloc alternatice, but see also my older thread:

As I said, I’m not sure, I didn’t yet read the Ruby … Deep Dive, I guess it does help most far small objects, because of cache issues, not (huge) arrays of small, i.e. matrices Julia is most optimized for. At least I would first look into the simple change I proposed, and as a last resort, compacting, or a hybrid solution. If the small objects would most likely be strings, then those oare object very likely to be passed to or from e.g. C. For interop you can’t use compacting, so it seems then the user would have to predict which objects, e.g. Strings will be passed to C (as CString).

I note from the Deep Dive (outdated?):

It’s important to also note that as of Ruby 3.0, compaction is still under development and actually currently degrades performance on most apps. It’s therefore not automatically turned on.

You might think with the blog post from May 2021 and Ruby 3.0 rather recent they went with good algorithms, something recent, but:

The algorithm Ruby uses for compaction is called a two-finger compaction algorithm [.] actually first written about in “The Programming Language LISP: Its Operation and Applications” by Edmund C. Berkeley and Daniel G. Bobrow in 1964.

i.e. in an era where memory was expensive. So I agree:

It’s quite incredible that this is the algorithm Ruby uses today (57 years later!),

I’m puzzled why, do they target tiny microcontrollers? I thought mostly webserving, and then maybe string processing tilt in their favor over large matrix handling?

In a classic result, Robson showed that all such allocators can
suffer from catastrophic memory fragmentation [30]. This
increase in memory consumption can be as high as the log
of the ratio between the largest and smallest object sizes allo-
cated. For example, for an application that allocates 16-byte
and 128KB objects, it is possible for it to consume 13× more
memory than required.

Surprisingly, I found MESH that can do that (a drop in replacement, and could apply to Julia too): https://raw.githubusercontent.com/plasma-umass/Mesh/master/mesh-pldi19-powers.pdf

Crucially and counterintuitively, Mesh performs com-
paction without relocation; that is, without changing the
addresses of objects. This property is vital for compatibility
with arbitrary C/C++ applications. To achieve this, Mesh
builds on a mechanism which we call meshing
[…]
Our analysis shows that Mesh breaks the above-mentioned Robson worst
case bounds for fragmentation with high probability [30], as
memory reclaimed by meshing is available for use by any
size class. This ability to redistribute memory from one size
class to another enables Mesh to adapt to changes in an ap-
plication’s allocation behavior in a way other segregated-fit
allocators cannot.

We implement Mesh as a library for C/C++ applications
running on Linux or Mac OS X. Mesh interposes on memory
management operations, making it possible to use it without
code changes or recompilation by setting the appropriate
environment variable to load the Mesh library (e.g., export
LD_PRELOAD=libmesh.so on Linux). Our evaluation demon-
strates that our implementation of Mesh is both fast and
efficient in practice. It generally matches the performance of
state-of-the-art allocators while guaranteeing the absence of
catastrophic fragmentation with high probability. In addition,
it occasionally yields substantial space savings: replacing the
standard allocator with Mesh automatically reduces memory
consumption by 16% (Firefox) to 39% (Redis).
[…]
Mesh enables compaction without relocating object addresses;
it depends only on hardware-level virtual memory support,
which is standard on most computing platforms like x86 and
ARM64. Mesh works by finding pairs of pages and merging
them together physically but not virtually: this merging lets
it relinquish physical pages to the OS.

Still mimalloc is always fastest (or close to within 3%) of the other allocators benchmarked there, including MESH, which can be up to 155x slower…

I guess we could run the GCBenchmarks with different mallocs, not sure how easy/hard that would be, but maybe there is some possible benefit there

systemd-oomd does something like this at a system level. My desktop machine never goes into a multi minute thrash freeze anymore.

Imagine your heap has say 10MB chunks each of which has 100kB of randomly located integer arrays… Let’s say you’ve got 1000 of these arrays, so you have 100MB allocated, but in 10GB of virtual address… The OS knows nothing about the fact that these chunks are mostly not in use by Julia… Now try to allocate a Gig of space on your 8G machine with a couple gigs of swap… Boom it’s broken.

Nice. I didn’t know that. Guess I’m too old school :wink:
Although I deploy on Linux I develop on Windows and there you can still get into the situation where you get to 100% RAM taken and then the julia GC frantically trying to free memory and very large (GBs) fluctuations in julia memory but never stabilizing as long as you keep working (doing requests).

Thanks for clarifying, not sure how often in practice the address space gets (very) fragmented in practice. I may have downplayed the risk too much. I guess there’s a reason Ruby looked into compacting GC (e.g. for long-running web servers), may or may not?) apply less to more typical HPC use of Julia. Anyway there’s a solution with MESH I posted in my latest post. I think you can do that right now with Julia without code changes in Julia, maybe there’s already some benefit (or none). It may need the same simplification of Julia code I mentioned for mimalloc, to see all the free that would happen.

This assumes the OS doesn’t do anything clever with the holes in virtual address space, and I suppose it can’t because they were allocated as memory (maybe used previously), and the OS never informed not free/garbage by now. I suppose it could, and would need help from the libc. Or just allocate very large swap file, it’s cheaper than RAM. Or use MESH.

Incidentally, it’s very likely this issue bit me recently…

Many poople is being hurted by this problem, I am facing the same problem.

I have a computer with RAM 128G, but because of this problem, it turns out that the big RAM is useless.