Julia vs Microsoft's (Rust-inspired) Verona language and e.g. snmalloc allocator

The research project doesn’t mean “Microsoft is no longer using C++/C#/Rust”. See under C.

The project is meant to be learned from, to integrate the ideas into other languages. I’m not sure if it’s possible to add to Julia their “new concurrency model that seamlessly integrates ownership.” Still, it seems we could reuse their new allocator for a performance benefit.

A.
The allocator is intriguing “enables 1000s of remote deallocations to be performed with only a single atomic operation enabling great scaling with core count. […] The fast paths are highly optimised with just two branches on the fast path for malloc […] Deallocations occur in large batches”; runs on exactly all the platforms Julia supports (and ARM, including Android and iOS), “running in both the FreeBSD kernel and userspace”. It’s “a header-only C++ library, it can be LD_PRELOAD ed on Elf platforms (e.g. Linux, BSD), and there is a crate to use it from Rust.”

From the paper on the allocator: “[Such asymmetric scenarios] also occur in some GC implementations.” A problem they solve, for at least non-GC languages, I’m not yet sure if the solution also applies to Julia (or other GC languages).

Note, the paper is a bit outdated, and the difference.md doc has “[2-4] Are changes that are directly inspired by (mimalloc)” (another Microsoft allocator, probably older, and at least that one is a drop-in replacement).

Section 2.6 of the paper was intriguing, describing how SuperMalloc uses floating-point for size classes, “2-bit mantissa, m, and 6-bit exponent, e”, with their allocator doing “similar calculation” also using 8-bit number for size, while ensuring “16-byte aligned” allocations.

Also intriguing in the paper is section 2.9 Portability with e.g. “lazy commit”, and how Windows doesn’t support that, and how they do things differently.

And “WarmupTime” [for SuperMalloc]

B.

mimalloc (pronounced “me-malloc”) is a general purpose allocator with excellent performance characteristics. Initially developed by Daan Leijen for the run-time systems of the Koka and Lean languages. Latest release: v1.6.2 (2020-04-20).

It is a drop-in replacement for malloc

C.

We are providing a new concurrency model that seamlessly integrates ownership.

And from the FAQ:

There are several areas we are investigating in Project Verona. A few of the high-level questions are here:

  • If we design a language without concurrent mutation, can we build scalable memory management?
  • Can linear regions be used to remove the restrictions of per-object linearity without sacrificing memory management?
  • Can language level regions be used to support compartmentalisations?
6 Likes

Nice :sailboat:.
Is Julia’s memory API (or the appropriate subset) amenable to whichever would be best for us?

1 Like

I tried both allocators, and building them was simple (on Linux, there are also instructions for Windows), as advertised, and FYI, I needed to do e.g.:

sudo apt-get install ninja-build

I could run Julia with LD_PRELOAD, but the running time was very similar, so I doubt the allocator is actually used (except at least LD_PRELOAD doesn’t something as it interferes with exit of Julia); I think Julia would need to dynamically link for this to work (but very likely Julia statically links malloc and free) and likely then faster:

$ time LD_PRELOAD=/home/../build/libsnmallocshim.so julia -t4 ~/binary-trees.jl 20
stretch tree of depth 21	 check: 4194303
1048576	 trees of depth 4	 check: 32505856
262144	 trees of depth 6	 check: 33292288
65536	 trees of depth 8	 check: 33488896
16384	 trees of depth 10	 check: 33538048
4096	 trees of depth 12	 check: 33550336
1024	 trees of depth 14	 check: 33553408
256	 trees of depth 16	 check: 33554176
64	 trees of depth 18	 check: 33554368
16	 trees of depth 20	 check: 33554416
long lived tree of depth 20	 check: 2097151
munmap_chunk(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f1f5f9b5896)
unknown function (ip: 0x7f1f5f9bc909)
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
close_unit_1 at /src/gcc-7.3.0/libgfortran/io/unit.c:727
close_units at /src/gcc-7.3.0/libgfortran/io/unit.c:780
unknown function (ip: 0x7f1f61d8ab72)
unknown function (ip: 0x7f1f5f96f040)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 306019175 (Pool: 306019023; Big: 152); GC: 210
Aborted (core dumped)

real	0m9,103s
user	0m8,797s
sys	0m0,967s

However mimalloc worked perfectly on a corresponding C program (I’m not actually sure they’re limited to 4 threads) for a 3x speedup:

$ time LD_PRELOAD=/home/../mimalloc-master/out/release/libmimalloc.so ./binarytrees.gcc-5.gcc_run 20
[..]
real	0m0,996s   # vs. 0m2,981s without LD_PRELOAD
user	0m5,370s
sys	0m0,080s

and kind of for the snmalloc one:

$ time LD_PRELOAD=/home/../snmalloc-master/build/libsnmallocshim.so ./binarytrees.gcc-5.gcc_run 20
stretch tree of depth 21	 check: 4194303

[Here in-between there's a long pause.]

1048576	 trees of depth 4	 check: 32505856
262144	 trees of depth 6	 check: 33292288
65536	 trees of depth 8	 check: 33488896
16384	 trees of depth 10	 check: 33538048
4096	 trees of depth 12	 check: 33550336
1024	 trees of depth 14	 check: 33553408
256	 trees of depth 16	 check: 33554176
64	 trees of depth 18	 check: 33554368
16	 trees of depth 20	 check: 33554416
long lived tree of depth 20	 check: 2097151

real	0m12,657s
user	1m26,205s
sys	0m0,140s

Note, a was using the fastest C program–that actually uses malloc, not a memory pool–and the former allocator makes it competitive with the fastest C program, that uses a memory pool. That fastest C program is slightly faster at 0m0,671s. Predictably the allocator doesn’t help the memory pool version, but since Julia is disallowed to use such, we could gain from it.

I was using the fastest Julia program and that C program: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gcc-5.html

2 Likes

FYI: There’s no wander the Julia program is slow, it’s not multi-threaded, and I used -t4 (new in Julia 1.5), rather than -p4.

It does scale from 1 up to 4 processes, but not really much after that, can get slower with many. I should compare some C program to a threaded Julia program, so far only ran the Julia fasta below. Note, I have 16 (I think hyperthreaded) virtual cores, and if the C programs use all, rather than 4, as it seems, it’s part of the speed difference.

$ time julia -t2 ~/fasta.jl 200000000 >/dev/null

real	0m14,548s
user	0m27,697s
sys	0m0,828s

$ time julia -t2 ~/fasta.jl 200000000 >/dev/null
real	0m9,919s
user	0m33,111s
sys	0m1,051s


I get roughtly the same speed/scalability with either allocator (consistent with them not running), as without LD_PRELOAD, but I do get interesting errors, likely to not really be the "double free" option:

$ time LD_PRELOAD=/home/pharaldsson_sym/mimalloc-master/out/release/libmimalloc.so julia -t16 ~/binary-trees.jl 20
stretch tree of depth 21	 check: 4194303
1048576	 trees of depth 4	 check: 32505856
262144	 trees of depth 6	 check: 33292288
65536	 trees of depth 8	 check: 33488896
16384	 trees of depth 10	 check: 33538048
4096	 trees of depth 12	 check: 33550336
1024	 trees of depth 14	 check: 33553408
256	 trees of depth 16	 check: 33554176
64	 trees of depth 18	 check: 33554368
16	 trees of depth 20	 check: 33554416
long lived tree of depth 20	 check: 2097151
free(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f785438c896)
unknown function (ip: 0x7f7854393909)
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
close_unit_1 at /src/gcc-7.3.0/libgfortran/io/unit.c:727
close_units at /src/gcc-7.3.0/libgfortran/io/unit.c:780
unknown function (ip: 0x7f7855758b72)
unknown function (ip: 0x7f7854346040)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 305998941 (Pool: 305998742; Big: 199); GC: 210
Aborted (core dumped)

real	0m9,047s
user	0m8,990s
sys	0m0,781s
pharaldsson_sym@SYMLINUX011:~/mimalloc-master/out/release$ time LD_PRELOAD=/home/pharaldsson_sym/mimalloc-master/out/release/libmimalloc.so julia -t4 ~/fasta.jl 200000000 >/dev/null

double free or corruption (out)

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7ffb78016896)
unknown function (ip: 0x7ffb7801d909)
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
close_unit_1 at /src/gcc-7.3.0/libgfortran/io/unit.c:727
close_units at /src/gcc-7.3.0/libgfortran/io/unit.c:780
unknown function (ip: 0x7ffb793e2b72)
unknown function (ip: 0x7ffb77fd0040)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 1649495 (Pool: 1649237; Big: 258); GC: 140
Aborted (core dumped)

real	0m10,071s
user	0m32,741s
sys	0m0,778s

$ time LD_PRELOAD=/home/pharaldsson_sym/mimalloc-master/out/release/libmimalloc.so julia -t2 ~/fasta.jl 200000000 >/dev/null
free(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7fb284082896)
unknown function (ip: 0x7fb284089909)
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
close_unit_1 at /src/gcc-7.3.0/libgfortran/io/unit.c:727
close_units at /src/gcc-7.3.0/libgfortran/io/unit.c:780
unknown function (ip: 0x7fb28544eb72)
unknown function (ip: 0x7fb28403c040)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 1571179 (Pool: 1570934; Big: 245); GC: 145
Aborted (core dumped)

real	0m14,901s
user	0m28,171s
sys	0m0,572s

$ time LD_PRELOAD=/home/pharaldsson_sym/mimalloc-master/out/release/libmimalloc.so julia -t4 ~/fasta.jl 200000000 >/dev/null
double free or corruption (out)

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f5a1e77e896)
unknown function (ip: 0x7f5a1e785909)
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
close_unit_1 at /src/gcc-7.3.0/libgfortran/io/unit.c:727
close_units at /src/gcc-7.3.0/libgfortran/io/unit.c:780
unknown function (ip: 0x7f5a1fb4ab72)
unknown function (ip: 0x7f5a1e738040)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 1636302 (Pool: 1636043; Big: 259); GC: 140
Aborted (core dumped)

real	0m11,974s
user	0m42,593s
sys	0m0,697s



$ time LD_PRELOAD=/home/pharaldsson_sym/snmalloc-master/build/libsnmallocshim.so julia -t4 ~/fasta.jl 200000000 >/dev/null
munmap_chunk(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f32b1cff896)
unknown function (ip: 0x7f32b1d06909)
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
close_unit_1 at /src/gcc-7.3.0/libgfortran/io/unit.c:727
close_units at /src/gcc-7.3.0/libgfortran/io/unit.c:780
unknown function (ip: 0x7f32b40d4b72)
unknown function (ip: 0x7f32b1cb9040)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 1627951 (Pool: 1627694; Big: 257); GC: 140
Aborted (core dumped)

real	0m11,449s
user	0m35,675s
sys	0m1,644s

$ time LD_PRELOAD=/home/pharaldsson_sym/snmalloc-master/build/libsnmallocshim.so julia -t2 ~/fasta.jl 200000000 >/dev/null
munmap_chunk(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f76b53a8896)
unknown function (ip: 0x7f76b53af909)
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
close_unit_1 at /src/gcc-7.3.0/libgfortran/io/unit.c:727
close_units at /src/gcc-7.3.0/libgfortran/io/unit.c:780
unknown function (ip: 0x7f76b777db72)
unknown function (ip: 0x7f76b5362040)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 1569819 (Pool: 1569577; Big: 242); GC: 144
Aborted (core dumped)

real	0m15,124s
user	0m28,330s
sys	0m0,934s

There is:

help?> Libc.malloc
  malloc(size::Integer) -> Ptr{Cvoid}

  Call malloc from the C standard library.

I was actually thinking of converting the binary-tree program at BenchmarkGame to use it, to at least see if faster. Since this is an option in Julia, they should accept…, while doing this is very non-ideomatic.++

There’s a possibility that with LD_PRELOAD it might work, i.e. actually bypass the C library. I don’t know, but if not ccall should be simple enough to use those allocators. If I can demonstrate a speedup, then likely it would also work without calling directly, by statically linking the allocators.

++ Another option, simple to try out, is simply to disable GC temporarily.

Hi, it is pretty cool to have some tests with julia.

However, I am little bit concerned that you are probably using the Debug version of snmalloc.

I am investigating the call stack when the segment fault happens and it seems that Julia is still trying to free the memory using libc rather than snmalloc:

__libc_message at /usr/bin/../lib/libc.so.6 (unknown line)
malloc_printerr at /usr/bin/../lib/libc.so.6 (unknown line)
_int_free at /usr/bin/../lib/libc.so.6 (unknown line)
3 Likes

Really?

@Palli Do you compare with https://github.com/plasma-umass/Mesh? IIUC this also supports “monkey-patching” malloc via LD_PRELOAD.

No, not for sure, you would know?

I just inferred from no speedup.

With a multi-threaded Julia program (Fasta), this time around, which actually gives multi-threaded speedup, I do not see snmalloc help (actually a bit slower) getting same speed as without LD_PRELOAD, unlike for a C program (binary-tree) where snmalloc gives 2.8x speedup.

$ time LD_PRELOAD=/home/pharaldsson_sym/snmalloc-master/build/libsnmallocshim.so ./binarytrees.gcc-5.gcc_run 20
stretch tree of depth 21	 check: 4194303

[this time a round, no stop here, when not compiled with debug mode.]

real	0m1,036s   # without LD_PRELOAD: 0m2,941s
user	0m5,568s
sys	0m0,108s

From what I can tell, the C program calls malloc for every object allocation. Julia does not do that; it only calls malloc occasionally and then does its own pool allocation. So I would expect Julia to be less sensitive to which malloc is used for workloads that allocate a large number of small objects.

Would it be worth having a way of turning this behavior off? It seems plausible that a well designed allocator would make it so that attempts to call it less would end up backfiring.

We have that; if you enable MEMDEBUG in src/options.h all objects will be allocated with malloc. It would be interesting to see comparisons using that.

I think that’s probably true for the case of malloc and free, but the problem is the malloc/free API itself — it just doesn’t lend itself to an efficient implementation. Or I suppose malloc might be fine, if free were deleted. Generational garbage collectors can be O(1) in the number of dead objects, while free requires you to call it for every dead object, making that impossible. Region-based manual memory management might give you something similar, but malloc/free by themselves, of course, don’t provide that.

So the allocation pools we use aren’t just an attempt to work around slow mallocs; they are integrated with the GC strategy to handle larger blocks of objects at once.

2 Likes

I did, but it seems not “all objects will be allocated with malloc”:

LD_PRELOAD=/home/pharaldsson_sym/snmalloc-master/build/libsnmallocshim-1mib.so ./julia -t16

julia> const MEMDEBUG = ccall(:jl_is_memdebug, Bool, ())
true  # for my Julia build (my first Julia build!), otherwise false, so seems right

julia> @timev fasta(20000000, devnull)
  1.942194 seconds (581.46 k allocations: 848.437 MiB, 10.06% gc time)
elapsed time (ns): 1942193511
gc time (ns):      195396685
bytes allocated:   889650782
non-pool GC allocs:576253  # this number is not constant, even after first run
malloc() calls:    5210   # note, here I get same number without MEMDEBUG
GC pauses:         13
full collections:  2

There’s probably no speed difference w/MEMDEBUG (but maybe a small speedup, 2%, for -t4 for reasons I guess unrelated to MEMDEBUG, -t16 is slower for all mallocs) for latest nightly vs. my older Julia 1.5-DEV, and there no speed-up when adding any of allocators with LD_PRELOAD, I tried the other snmalloc .so variant too, and another (mim) alloc.

This number does not include objects allocated by the GC.