Julia GC, heap fragmentation, out of memory, push!/append!

Thanks for bringing it up, but I think it’s a non-issue, or I mean with an alternative allocator a small issue.

It’s time to revisit mimalloc (I looked into linking alternatives to Julia before):

mimalloc is a drop-in replacement for malloc and can be used in other programs without code changes, for example, on dynamically linked ELF-based systems (Linux, BSD, etc.) you can use it as:

> LD_PRELOAD=/usr/lib/libmimalloc.so  myprogram

It also has an easy way to override the default allocator in Windows. Notable aspects of the design include:
[…]

  • free list sharding: instead of one big free list (per size class) we have many smaller lists per “mimalloc page” which reduces fragmentation and increases locality – things that are allocated close in time get allocated close in memory. […]
  • free list multi-sharding: the big idea! Not only do we shard the free list per mimalloc page, but for each page we have multiple free lists.
  • secure: mimalloc can be built in secure mode, adding guard pages, randomized allocation, encrypted free lists, etc. to protect against various heap vulnerabilities. […]
  • bounded: it does not suffer from blowup [1], has bounded worst-case allocation times (wcat), bounded space overhead (~0.2% meta-data, with low internal fragmentation), and has no internal points of contention using only atomic operations.
  • fast: In our benchmarks (see below), mimalloc outperforms other leading allocators (jemalloc, tcmalloc, Hoard, etc), and often uses less memory. A nice property is that it does consistently well over a wide range of benchmarks. There is also good huge OS page support for larger server programs.

The thing is when I tried it, it surprisingly didn’t help, but that’s because each allocation in Julia doesn’t directly lead to malloc in C. Julia does something similar, tries to allocate its own larger blocks (to reduce fragmentation, just likely not in a state-of-the-art way, that seems to be it, and I’m not sure it conflicts with our GC). I didn’t look into bypassing that, but would likely be great, as simplifying Julia’s codebase, and opting into this 8k LOC library (which might be larger than the code we eliminate, but still a win, then not part of our codebase).

Bypassing (eliminating) the Julia code, as also great because then we can try other options too, not just the devs, but also users, choosing the alternatives with LD_PRELOAD above, since:

General memory allocators are interesting as there exists no algorithm that is optimal – for a given allocator one can usually construct a workload where it does not do so well. The goal is thus to find an allocation strategy that performs well over a wide range of benchmarks without suffering from (too much) underperformance in less common situations.

I’m skeptical that’s the solution. Moving stuff around means actual copying (EDIT: usually, not always see MESH allocator in my post below) and CPU/RAM bandwidth overhead, while, yes eliminating fragmentation, but the fragmentation we do have, and will not eliminate completely (just reduce), is in virtual space (mostly), which is free? It used to be a limited resource too (in 32-bit era, we shouldn’t optimize for that), but in 64-bit it’s at least close to free. I’m not sure where the holes in address space end up, if in a swap file, then that’s just stupid. This is a solved problem for sparse files, so I doubt it’s done that way.

From memory (and claim above), that’s likely the best alternative malloc alternatice, but see also my older thread: